DRAEY 


REPORT ON 


DOCUMENTING 
THE DIGITAL AGE 


A conference sponsored by the National Science Foundation, 
MCI Communications Corporation, Microsoft Corporation, 
and History Associates Incorporated 


James B. Gardner 
Project Director 
History Associates Incorporated 


CONTENTS 


Introduction 

Background 

Conference Program and Participants 

Position Papers, Responses, and Discussion 
Follow-Up Comments 

Findings 

Next Steps 

Appendix A: List of Participants and Affiliations 
Appendix B: Participant Biographies 

Appendix C: Conference Program 


Appendix D: Position Papers 


INTRODUCTION 


The National Science Foundation, MCI Communications Corporation, Microsoft 
Corporation, and History Associates Incorporated have joined together for a special 
initiative on "Documenting the Digital Age." At a conference in San Francisco on 
February 10-12, 1997, the four cosponsors brought together key figures from the 
academic, archival, corporate, government, legal, and technology communities to discuss 
how to go about archiving digital information and records. Although there has been and 
will continue to be substantial discussion on this subject in various other forums, this 
conference was unique in bringing together a wide range of public and private 
stakeholders to discuss priorities and establish partnerships for moving forward with this 
important work. The goals of the conference were both to launch an effort to preserve the 
history of the digital age and to broaden discussion about preserving history in the digital 
age, both against the background of fast and changing technologies. 


BACKGROUND 


Since 1987, the Internet has grown exponentially, utilizing new technology to link a 
millions of individuals around the world in a network of networks and fostermg a n¢w hens 
culture. As the Internet has developed and grown, not only the technology but also the po o 
and social issues and challenges surrounding its development have changed, yielding a ew ors Or 
system today very different from that of only a few years ago. Unfortunately, we are losing the 
tecord of that evolving Intemet--every day Web sites are disconnected or news groups fold, too 
often leaving no permanent record. Consider the irony, suggests Nathan Myhrvold, Chief 
Technology Officer at Microsoft, of a "technology that lets us store digital information with 
perfect fidelity and make it available to the world incredibly cheaply [but] does not naturally 
produce a record as its byproduct, .. The surprising truth is that the early days of the digital age 
will appear almost pre-literate to future historians. . . . if we don't save it, it isn't part of the 


historical record." 


Some of that history is being preserved. The Intemet Activities Board has, for 
example, documented the protocols of the first packet-switching networks, and its Internet 
Experiment Notes and Requests for Comment serve as a key archive of technical information. 
Established in 1980 at the University of Minnesota, the Charles Babbage Institute has become a 
leading center for historical research and archival activity on the Internet, and it has been joined in 
recent years by a variety of other organizations and individuals, yielding a growing literature that 
includes published and unpublished studies and oral histories. Last year, Brewster Kahle 
established the Internet Archives, the first effort to document current Internet activity. The 
problem is that we have some of the pieces but not all; we lack the systematic documentation 
necessary for a coherent history of the Internet that captures the activity and information, the 
structure and context. We have lost important records on the early development of the Internet 
under the Advanced Research Projects Agency of the Department of Defense and have failed to 
document adequately the Internet's growth and development. Consider, for example, how little 
documentation we have of the extensive discussions about commercial contributions to and 
activity on the Internet. Or consider the Web sites We visit today--they reveal nothing about the 
early sites they replaced, making it difficult to assess how much the Web has changed and in what 
ways. It is one thing to recall the early days of the Internet or write a secondary account; it is 
quite another to capture the activity and information, the structure and context. That 
documentation will be lost forever without a collaborative effort of public and private 


stakeholders. 


But the problem is much larger than just the history of the Intemet--there ig also 
history on the Internet and intranets. Consider the extent to which e-mail has replaced personal 
correspondence, or how much of the work of business and government is handled through internal 


electronic communication without a paper trail. Without swift and decisive action to preserve 
digitized information in all forms, we are in danger of losing the documents and information 
necessary to write the history of our own times. 


In the spring of 1996, concems over the vanishing history of the World Wide Web 
surfaced in an e-mail discussion initiated by Mr. Myhrvold. Responding to the concems raised, 
Philip L. Cantelon, president of History Associates Incorporated, proposed convening a two-day 
conference of key players to begin broad cross-disciplinary discussion on archiving the Internet 
and establish guidelines for future action. At Dr. Cantelon's invitation, James B. Gardner agreed 
to work with HAI to develop a conference proposal. In the fall of 1997, MCL, Microsoft, and the 
National Science Foundation agreed to provide the necessary funding, and Mr. Myhrvold and 
Vinton G. Cerf, Senior Vice President of Internet Architecture at MCI, agreed to serve as co- 
chairs. Although the original plan was to convene the conference in the fall of 1997, all quickly 
agreed that the urgency of the problem required more expeditious action and the conference was 
scheduled for February 1997. 


At the same time, the project principals decided to broaden the agenda to include not 
just the Internet and its history but broader concerns about preserving electronic records--what 
essential data and documents to preserve, how to preserve them, and how to make them accessible 
and usable. Invited position papers and conference discussion would address varieties of 
electronic information (e-mail, discussion groups, long-distance computing, and file transfers), 
complicating factors such as hardware and software obsolescence, and concerns about the 
adequacy of search tools and privacy protection. The conference would conclude with a 
discussion of the next steps to take in this important collaborative venture. To move discussion 
beyond the usual professional boundaries, the sponsors agreed to invite experts from the private 
and public sectors, specialists in archives, communications, digital technology, history, and the 
law. 


CONFERENCE PROGRAM 
AND PARTICIPANTS 


In consultation with Dr. Cerf and Mr. Myhrvold, Dr. Gardner develop ed the 
program and invited individuals to develop eight position papers. Each was asked to write 
a paper of eight to ten pages, to be completed four weeks prior to the conference and 
made available to the other participants on the conference's Web page P 
(http://dtda.mci.com). Similarly, Dr. Gardner recruited appropriate individuals to chair the 
sessions and provide formal responses to the position papers. Every invited participant 
was asked to take part in the program in some capacity. 


The mix of participants was impressive, including not only historians and archivists 
but also corporate executives, engineers and technology experts, journalists, and legal 
experts. Participant affiliations included the Commission on Preservation and Access, the 
Computer Museum History Center, the Electronic Privacy Information Center, the 
International Association for Social Science Information Services and Technology, the 
Internet Archive, the National Archives and Records Administration, the Research 
Libraries Group, the Smithsonian's National Air and Space Museum and National Museum 
of American History, the Text Encoding Initiative, and the World Wide Web Consortium. 

‘A complete list of participants and their affiliations can be found in appendix A and brief 
individual bios in appendix B. 


The meeting began on Monday evening, February 10 with welcoming remarks 
from the project sponsors and a keynote address. The conference continued the next day 
with morning sessions on "Documenting the Digital Age: What Do We Know?" and 
"Capturing History Digitally: Why Archive the Internet?" followed by afternoon sessions 
on "Assessing the Need: What Information and Activities Should We Preserve?" and 
"Developing Strategies: How Do We Archive Digital Records?" Sessions on the final 
morming (Wednesday, February 12) addressed "What Are the Legal Complications in 
Archiving Digital Records?" and "How Do We Make Electronic Archives Usable and 
Accessible?" Each of the six sessions began with a brief (ten minute) summary by the 
paper author or authors, followed by both formal responses and open discussion by 
participants. The conference concluded with a wrap-up session on "Taking Action: What 
Do We Do Next? Who Will Take Responsibility? Who Will Provide Funding?" See 
appendix C for a copy of the program. 


7 — 


_— oh. i. <cifii 


OO OC O———a—eEe 


& 


POSITION PAPERS, RESPONSES, 
AND DISCUSSION 


The following are summaries of position papers, responses, and discussion, 
organized by day and topic. For the full text of the papers, see appendix D. 


Monday, February 10 
Opening Dinner, Welcome, and Keynote 


Philip L. Cantelon, History Associates Incorporated, opened the conference at 
7:30 p.m., welcoming the attendees and introducing Bert C. Roberts, Jr., MCI 
Communications Corporation, Vinton G. Cerf, MCI Communications Corporation, and 
Nathan Myhrvold, Microsoft Corporation, each of whom made brief welcoming remarks. 
Cantelon outlined the goals of the conference--to bring together individuals from archives, 


‘engineering, history, law, and science to establish a "common path" for preserving both 


the history of digital technology and digital information and records. Roberts followed, 
voicing his support for that agenda and emphasizing the importance of this issue to MCI. 
To illustrate the need for documenting the digital age, Cerf then described a recent visit to 
the Library of Congress's Manuscript Division, where he had the opportunity to look at 


- Alexander Graham Bell's notebooks, realizing that future generations interested in the 


development of digital technology would not have such opportunities if steps are not 
taken now. Finally, Myhrvold reinforced that point, emphasizing the importance of 
historical perspective to his own work, the need to look back in order to look forward. 


After dinner, James M. Muldoon, Professor of History, Rutgers University- 
Camden, gave the conference keynote, "Communications Revolutions: Little History, 
Much Myth." Drawing on his work as a specialist in medieval church law, Muldoon 
described the communications revolution as a four-part movement, beginning with the 
adoption of parchment for copying manuscripts between the sixth and tenth centuries and 
continuing through the development of digital technology in the late twentieth century. 
Using myths about the printing revolution in the mid-fifteenth century as illustrations, 
Muldoon stressed the need for historians of the digital age to look at that technology 
within the larger historical context of the time, to avoid looking for a single "great 
inventor," to recognize that the historical process is not a linear story of achievement and 
success but includes dead ends, failures, and missed opportunities, and to look for what 
we may have lost in the development of this new technology. He concluded, "What is of 


special interest is that by 1520, two generations after the printing press appeared on the 
scene, Europeans were asking about where it came from, how it was developed, and what 
were its consequences. Even at that seemingly early date, the origins of printing were lost 
in the mists of time. Given the short length of a generation in the digital world, this 
conference is not being held any too soon." 


Tuesday, February 11 
Session: "Documenting the History of the Digital Age: What Do We Know?" 
Chair: Steven Hensen, Perkins Library, Duke University 
Paper: Vinton G. Cerf, MCI Communications Corporation 
John C. Klensin, MCI Communications Corporation 
Response: Gwen Bell, The Computer Museum History Center 


The first session of the conference focused on assessing what we know about the 
history of the digital age. Steven Hensen, Duke University, chair of the session, began by 
emphasizing the critical importance of provenance and context. Recalling Cerf's story 
about Bell's notebooks, he reminded the participants that a unique document such as that 
can only be understood within the larger context of the Bell collections and the history of 
his work. 


Cerf presented the paper he co-authored with John C. Klensin, MCI 
Communications Corporation, who was unable to attend the conference. Cerf began with 
an overview of the history of digital computing, including hardware, programming 
languages and operating systems, networking and protocols, and the Internet. He then 
tumed to early efforts to document the development of digital technology, beginning with 
the Request for Comment (RFC) series first issued in 1969 to document the development 
of ARPANET and Intermet host level protocol design. He contrasted early series 
documents which captured much of the debates, choices, and tradeoffs with more recent 
RFCs which capture only polished results. Cerf also discussed other vehicles for 
information dissemination and documentation, including USENET, FTP archives, Gopher 
WAIS (Wide Area Information Service), and the World Wide Web. It is the last, 
according to Cerf, "which motivates some of the questions surrounding the capture of the 
history of the Digital Age." More specifically, how do we capture and preserve the 
constantly changing content and variety of media? While Cerf supported the idea of a 
repository that would stabilize content, he expressed concer that archiving would not, 
because of intellectual property issues, include software and we would thereby lose the 


> 


experience and interaction that are integral to the Web. He also raised concems about 
cataloging and indexing non-text content and the challenge of archiving executable Web 
pages. He concluded with a cautionary note--that we recognize that archiving the public 
Internet would capture only a small part of a vastly larger digital world that encompasses 
corporate intranets and private electronic communication. 


Gwen Bell, the Computer Museum History Center, responded to Cerf's paper. 
She reviewed the basic principles of collecting and the need to address not only the "first" 
and the "classic" but also larger bodies of work, representative samples, and even failures 
and misdirections. She emphasized the importance of not depending on corporations to 
collect and archive this history and proposed that nothing less than five years old be 
collected, thus enabling us to separate the trendy from the trend-making. 


General discussion followed ranging over various issues raised by the Cerf-Klensin 
paper and Bell response. Several participants raised questions about the basic nature of a 
digital archive. Brewster Kahle, the Internet Archive, noted the difference between a time 
capsule of the past and an active archive focusing on current activity. Picking up on a 
point made by Bell, Roberts suggested that it might make sense not to archive current 
developments and activity but to wait, look back, and capture what is most significant, 
eliminating the "noise." To illustrate how daunting archiving the Internet is, Cerf noted 


- that the volume of information cycled through MCI alone in a week is 100 terabytes and 


that is growing at 300 percent a year. Kenneth Thibodeau, Center for Electronic Records, 
NARA, urged that archives of the digital age not be technocentric but encompass societal 
issues, including the role of changing society in the development and adoption of the 
technology and the impact of the technology on society. 


Another portion of the discussion followed up on Cerf’s concern about the 
usefulness of an archive that does not include the software necessary to experience the 
content. Hensen stressed how software and hardware obsolescence makes access to 
digital material problematic, and access is as important as preserving material. Cerf 
discussed new tools to transfer digital information from one medium to another, but 
participants raised concerns about the context that is lost in that transfer. 


The bulk of the discussion centered around how we make sure that what should be 
archived is, particularly in the corporate world. Margaret Hedstrom, University of 
Michigan, stressed that the development of a digital archive must begin with the creating 
of the information or record or there will simply be nothing meaningful or useful left to 


8 


capture. Cerf mentioned new “over-the-shoulder" software that would help the individual 
capture what he or she is doing. Marc Rotenberg, Electronic Privacy Information Center, 
suggested that the real problem may not be a reluctance to retain information or 
communications but a reluctance to record to begin with. Roberts focused the discussion 
on the particular problem faced by corporations. According to Roberts, corporations 
resist archiving for a number of reasons. While the need to cut noncritical expenditures 
plays a role, the real concern is about retaining files, electronic or paper, that could be 
used by the government or by opponents in litigation. Anne Van Camp, Research 
Libraries Group, added that many corporations simply do not see the value of archives and 
that there is no mandate for corporations to preserve their records. She argued that these 
records are part of our national cultural patrimony and that we need to determine how to 
save them without either being intrusive into corporate business or relying only on 
serendipity and good will. Roberts responded that corporations need some incentive to 
archive. Hedstrom asked about legal accountability as an incentive. Roberts agreed that 
that is an incentive for some documentation retention but noted that retaining only what is 
legally and financially required leaves out much of the documentation that historians will 
want. Myhrvold proposed cryptographically locking sensitive records, but Ronald Plesser, 
Piper and Marbury, responded that that would not eliminate the corporation's obligation to 
produce records in the event of litigation. 


Paul Ceruzzi, National Air and Space Museum, concluded the discussion with a 
reminder that, despite fears to the contrary, the telephone did not mark the end to the 
historical record and doubtless digital technology will not either. 


Session: ”Capturing History Digitally: Why Archive the Internet?" 
Chair: John McChesney, National Public Radio 
Paper: Nathan Myhrvold, Microsoft Corporation 
Response: Steven Lubar, National Museum of American History, Smithsonian 
Institution 


John McChesney, National Public Radio, the session chair, began by recounting his 
own personal interest in the past and his concerns about whether e-mail will be preserved 
in the same way that the correspondence of earlier generations was. 


Myhrvold began his presentation with a brief history of information, arguing that 
the computer provides a fundamentally different way of creating and storing information. 


(BARBER AERELEREEEEEEBEe ee 


According to Myhrvold, the pace of hardware and software innovation is such that in 
twenty years a computer will be able to do in thirty seconds what it takes a year to do 
now; in forty years, a computer will have the capacity to do in 30 seconds what today's 
would take a million years to do. With the Internet, the use of the computer has shifted 
from writing to communication, providing more flexibility, reach, and power than in any 
other form of communication. But, Myhrvold argued, the Intemet is "not naturally 
archival"--the medium is routinely overwritten and obsolesces, there is no archiving 
infrastructure, and "the rapid pace of technological change has created a cultural 
expectation of rapid update to the ‘latest and greatest’ version of any sort of information." 
The Internet will not be archived without a concerted effort. At stake, he argued, is not 
just the history of digital technology but indeed of our times. Myhrvold proposed that we 
save all the Internet, arguing that we cannot anticipate what future generations will value 
and that we cannot afford not to save it all, given the high cost of selecting and editing and 
the low cost of storing. Moreover, in just a few years, technological progress in 
information retrieval will be such that we will be able to access and use it all, contrary to 
historians’ and archivists' arguments that saving too much may be as bad as saving too 
little. For him, the ideal archive would combine snapshots of the net combined with 
incremental updates. While that would not capture all the activity on the Web, it is 
probably the best that we can do. "Without deliberate archiving," Myhrvold asserted, "the 
Internet information will be lost, and the historical record of the society based on it is 

- therefore fragmented and incomplete." 


Steven Lubar, National Museum of American History, began his response by 
explaining that the role of historians and archivists is not so much to save but to choose, to 
know what to keep and what not to. In making an acquisition, an archivist or curator 
must take into account both costs and benefits--what it will cost to acquire, process, and 
store versus how the item will be used. But does that process and that logic apply to 
digital information? Lubar conceded that digital technology makes it easier and cheaper to 
archive or store everything but argued that the issue is more complex than that. He 
discussed ways of selecting (such as sampling) that are not as expensive and time- 
consuming as item by item selection and warned that the technical solutions Myhrvold 
predicts may not resolve problems of hardware and software obsolescence and media 
instability. Moreover, who would use this Internet archive? Historians might use it if the 
cost were low enough, but they might actually be more interested in an archive of e-mail 
or software or in documenting how the Internet was used rather than what was on it. In 
other words, archiving the Internet may not be the best use of scarce funds. What it all 
boils down to, concluded Lubar, is what is the most efficient way to spend archival dollars 


10 


to preserve the history and culture of these times in a way that will be most useful in the 
future to those who find these times of interest. What are the costs? What are the 
benefits? 


The discussion that followed ranged from questions about whether the Internet 
should be the focus of archiving efforts to whether or how it can be done. Hedstrom 
questioned the assumption of the centrality of the Internet to research and information 
management, arguing that it is not yet dominant and must be viewed within a wider range 
of options. Myhrvold responded that his concern did not grow out of an assumption of 
greater value but from concern that this resource is being lost. Nevertheless, he claimed 
that the Internet is now the preferred vehicle for research or information retrieval, to 
which C. M. Sperberg-McQueen, University of Illinois, Chicago, replied that the ease of 
delivery associated with the Internet did not mean that it is a more reliable source of 
information. 


But assuming the value of archiving the Internet, what should be saved? Myhrvold 
suggested that the group consider three possibilities--capturing all of the Internet and 
addressing access and usability later; applying archival principles of selection on the front 
end and archiving only a portion of the Internet; or tuming to the creators of digital 
information to save their own work. Muldoon argued that saving the entire Internet 

- simply postponed the selection process necessary to bring order to material and make it 
usable by scholars. Mark Luker, National Science Foundation, countered that saving only 
a portion of the Internet risked losing what later generations might need to understand our 
times. Moreover, he argued, money is an issue: you cannot justify the expense of | 
selection on the front end when technology makes it cheap and easy to store it all. But 
Rotenberg suggested that we need to keep in mind the difference between arguing for 
what is technologically possible and for what should be done to meet research and 
information needs. Cantelon suggested that, given the different problems of and 
opportunities for intellectual and physical control raised by digital information, traditional 
paper-based archiving principles may no longer apply. He argued, for example, that digital 
archiving requires more active pursuit of records, not simply the willingness of an 
institution to serve as the repository and that we may need to refocus our attention from 
the saving of documents to their creation. Cerf suggested that the answer may be 
software that makes it possible for the Internet to save itself. 


There was also discussion of what "all" means. Rotenberg worried that the 
"snapshot" approach advocated by Myhrvold would not capture the dynamic character of 


ae ee 


11 


the Internet--how can you take a snapshot of something that is constantly changing? 
Similarly, John Markoff, The New York Times, proposed that we think in different 
structural terms--saving a "dynamic ecology" or environment rather than archiving static 
records. Myhrvold conceded that his approach would not capture all the activity but 
insisted that it is the best that we can do now and a reasonable compromise. 


Hedstrom expressed doubt that the participants would ever reach agreement on 
which approach to take and argued that what was more important was making sense of 
whatever is archived. We need to document the context for whatever decisions are made, 
she maintained. While we recognize the Web or the Intemet as a logical category around 
which to establish an archive, that decision may not be readily understandable or self- 
explanatory to future generations. We must document, she argued, the larger context of 
late twentieth-century information technology and make clear the logic or intent behind 


whatever we do. 


Finally, Plesser expressed concern that the discussion seemed to be assuming a 
centralized archive of the Intemet, a concept he found troubling. Given the decentralized 
nature of the Intemet itself, he argued for a decentralized archive. 


- Session: ”Assessing the Need: What Information and Activities Should We 


Preserve?" 
Chair: John Markoff, New York Times 
Paper: Michael Miller, Records Management, National Archives and Records 
Administration 


Responses: 
Paul Ceruzzi, National Air and Space Museum, Smithsonian Institution 


Francis X. Blouin, Bentley Historical Library, University of Michigan 
Elizabeth Stephenson, Institute for Social Science Research, UCLA, and 
the International Association of Social Science Information Service 


and Technology 


Broadening the agenda to address digital records as well as information, Michael 
Miller, NARA, focused on three issues: what should be preserved, what preservation 
entails, and who should do it. Deciding what to preserve involves determining value, 
asking why something should be preserved. As an archivist, he advocates applying 
archival selection principles, which focus not on predicting what future generations will 


12 


want or need but on documenting an activity or entity. He argued that we should not 
confuse the electronic medium with the actual content or information and that we should 
keep in mind the various types of documentation, including public information, 
documents, administrative data, software, explanatory material, source materials, product 
development materials, data management information, and informal electronic 
communications. In developing a preservation strategy, the key issues are how much 
documentation to keep, for how long, in what format or at what level of functionality (for 
information or as an official record), and for what purpose. Explaining that there is no 
“one size fits all" solution, Miller argued that these questions must be addressed in 
establishing a digital archive and that archivists are prepared to deal with them. That 
answered part of the question of who "we" are, but Miller argued for broader participation 
(including records creators) and for a decentralized approach that involves multiple more 
specialized archives. In sum, he argued for limiting the amount of documentation retained 
and for being realistic about what we can manage in terms of preservation and access. 


Three individuals provided formal responses to Miller's presentation. The first, 

Francis X. Blouin, University of Michigan, suggested adding an organizational dimension 
to Miller's types of documents in order to grasp more fully the complexity of developing 
documentation strategies and adding "for whom" to his list of key issues. He expressed 
concem about the problem of motivating corporations in particular to preserve their 

‘records and suggested the need for a more compatible legal environment. Finally, he 
stressed the need to document not only digital information and records but also the 
complex development of the information industry, although he added that the history of 
the digital age would be written even if the corporate records do not survive--the issue is 
how full and accurate that history will be. 


Ceruzzi was the second respondent. He suggested that archival selection is a 
difficult task, muddied by richness and ambiguity. He argued that it is not always possible 
to know the value of a document and that sticking to hard and fast rules of selection may 
result in missing important pieces. He concluded by asking whether we should create a 
digital archive. His response was that, if it is technically possible, it will be done. Rather 
than debating the wisdom of an electronic archive, we need to be working to shape its 
direction. 


The third respondent was Elizabeth Stephenson, UCLA, who began by describing 
her work with data files and how they differ from the text files on which most of the 
discussion focussed. She then raised a series of concerns--about the importance of 


13 


migrating data to maintain current access, the costs of preservation in terms of time and 
money, the need to recognize the varieties of archives and purposes, the necessity and 
complications of addressing these issues globally, and the possibility of digital archives 
being of value in the larger economy. Finally, she critiqued the "snapshot" approach from 
the standpoint of data files, which are more "free flowing" than text files and harder to 


capture and preserve. 


In the discussion, the participants further explored the three issues Miller raised-- 
what should be saved, how it should be preserved, and who should do it. Roberts opened 
the discussion by raising concems about ensuring the integrity of digital information or 
records, and Jim Miller, the World Wide Web Consortium, noted work on developing a 
form of digital signature. Cerf emphasized that we must recognize that we are not just 
talking about text or data or even images and sound but also, for example, more complex 
and ephemeral Web pages that do not exist apart from interaction with the user. 
Following up, Michael Miller reiterated his concern that more than just bits be saved, that 
the software and hardware emulations must be available for future generations to 
understand how we used the data and the limits and restraints under which we worked. 
Moreover, we need to be concemed about not just public and government but also 
corporate information and records, particularly intranet communications which are not 
included in Internet archiving plans. Participants considered various options to encourage 


- corporations to save this information, including encryption and depositing in trusts that 


would isolate from government intrusion. But the legal specialists questioned whether 
such corporate records could indeed be legally withheld, and Roberts concluded that there 
would have to be a change in the government context before corporations would consider 


setting up intranet archives. 


Thibodeau suggested that we should consider archiving the other side of digital 
communication--the usage logs that can help us understand not just what was available but 
what people used. Apparently that would be technologically possible, but would it be 
legal? Plesser warned the group that tracking individual use would lead to enormous legal 
problems in terms of invasion of privacy. While recognizing the legitimacy of such 
concerns, Thibodeau insisted that it might be possible to save valuable usage records and 
protect individual privacy using well-established models from archival management. 


Discussion also addressed the issue of how such material should be preserved. 
Rotenberg reminded the participants that electronic distribution is itself a form of 
preservation, ensuring the existence of multiple copies that far exceed anything possible 


14 


with paper. Myhrvold suggested that existing indexing efforts such as Altavista might 
provide a sufficient archive for a substantial portion of the Web, but Hensen emphasized 
that indexing and archiving are not the same thing. Archiving provides more than just key 
word access; it provides the metadata essential for understanding the context. Blouin 
reinforced Hensen's comments, emphasizing that provenance and context must be part of 
any efforts to document the digital age. 


In discussing who should take responsibility, Hedstrom rejected the idea of "one 
archive" and called for a more complex distributed system of archives. Restating his 
argument that different records and information should be preserved at different levels of 
functionality and for different lengths of time, Michael Miller argued that even with a 
distributed archival system the question remains of who will save one copy for the long 
term. 


Session: ”Developing Strategies: How Do We Archive Digital Records?" 
Chair: Birgit Ireland, History Associates Incorporated 
Papers: 
Brewster Kahle, Internet Archive 
Donald Waters, Yale University Library and the Task Force on Archiving 
Digital Information 
Response: Adam L. Gruen, MCI Corporate Archives 


Donald Waters, Yale University, centered his presentation on a report from the 
Task Force on Archiving Digital Information, which he co-chaired for the Commission on 
Preservation and Access and the Research Libraries Group. He began by noting the Task 
Force's focus on the larger issue of digital information rather than records and by defining 
a digital archive as a body that can ensure the integrity and long-term accessibility of 
culturally significant knowledge in digital form. He then tumed to four questions--why 
save digital information, what should be saved, how it should be saved, and how do we 
get started. In regard to why, the Task Force argued that new knowledge builds on the 
old and that it is impossible to build a knowledge-based economy without archives. The 
"what" is "culturally significant information," which different stakeholders define in 
different ways. Integrity is central, and the features that determine that are content, fixity, 
reference, provenance, and context. When it comes to how to save digital information, 
Waters argued for carefully managing costs and finances within an operating environment 
that addresses appraisal and selection, accessioning, storage and access, and migration 


— 


— 


15 


strategies. The last include building hardware and software emulators, changing and 
refreshing media, changing formats, and working to build migration paths into hardware 
and software. He proposed beginning the building of digital archives through pilot 
projects, the development of support structures, and the identification of best practices. 
Finally, Waters called for multiple, distributed archives with common systems for retrieval, 
balancing redundancy with economies of scale. 


Kahle's presentation focused on the Internet Archive and the issues and concerns 
that arose in its development. The mission of the Internet Archive is to gather and store 
now all public information on the Internet (World Wide Web, NetNews, Gopher, and 
FTP) and in the future offer access. Kahle's goal is to provide an Intemet resource 
distinguished by reliability, accountability, and durability. Thus far, the most difficult 
complications he has encountered involve protecting privacy and copyright and dealing 
with on-line pomography. In terms of the technology, the key challenges are gathering 
the data, storing it, and providing access. As of February 1997, the Archive had gathered 
2 terabytes of data or information, primarily from the World Wide Web, ranging from 
HTMLs only for some Web sites to full archiving of others. Kahle has studied the costs 
and retrieval times for a variety of storage options and is using a combination of hard disk 
storage for more frequently accessed data and tape jukeboxes for the larger less used 
portion. Access remains an issues to address in the future, but Kahle's goal is to provide 


- search and other services that will facilitate access to and enhance the value of this 


information resource. [That evening, the conference participants visited Kahle's offices at 
the Presidio, where staff gave a demonstration of the Intemet Archive. ] 


Adam Gruen, MCI Corporate Archives, responded to the two position papers. He 
began with a basic proposition--that, whatever we do, our basic strategy must be to match 
services to customer requirements, not to what we want. In regard to Kahle's 
presentation, Gruen urged multiple such ventures, arguing that the likely redundancy 
would be useful in reducing overall risk. Noting that the same databases can be saved in 
different ways depending on who needs the information and why, he emphasized that cost 
must be the determining factor and that, in order to make them profitable, we must make 
sure such archives meet customer needs. Continuing with this argument, Gruen disagreed 
with Waters' thesis that we function in a knowledge-based economy and proposed instead 
a “happiness-based economy," where knowledge is not pursued as an end in itself but 
rather as a tool to achieve other goals. He agreed, however, with Waters emphasis on the 
initial responsibility of the creators for preserving information or records and called for a 


16 


broad mix of public and private responsibility rather than government funding for 
developing digital archives. 


As chair of the session, Birgit Ireland, History Associates Incorporated, provided 
additional comments, urging the participants to think more broadly about digital 
information, to move beyond Kahle's focus on the Internet and beyond Waters’ "culturally 
significant" information. She also raised concems about migration strategies and asked 
about the applicability of Kahle's technology to intranet communication. 


In the discussion that followed, Kahle emphasized that his project should be seen 
as a pilot effort saving only a small fraction of digital information and records and not 
covering personal and corporate communications. Cerf raised concerns regarding the long 
term referenceability of URLs, which Jim Miller argued would not be a problem. 
Hedstrom asked Kahle whether he is documenting his own work--how the Internet 
Archives has been assembled and what universe or portion thereof it represents. Kahle 
responded that they were including metadata. Plesser inquired whether government 
material in the public domain was being included, and Kahle responded that it was 
possible. Michael Miller noted that the government needs to archive what it puts on the 
Web, providing the documentation necessary for accountability, whether paper or 
electronic. Stephenson urged the development of search engines sooner rather than later. 

- Finally, Van Camp noted that the Intemet Archive could be seen as an example of 
aggressive rescue of endangered material, which falls within the recommendations of 
Waters' Task Force. 


Wednesday, February 12 
Session: ’What Are the Legal Complications in Archiving Digital Records?" 
Chair: Ronald Plesser, Piper and Marbury 
Papers: 
Trotter Hardy, William and Mary School of Law 
Marc Rotenberg, Electronic Privacy Information Center 
Response: Anne Van Camp, Research Libraries Group 


Trotter Hardy, William and Mary School of Law, began his presentation by 
reviewing the basics of copyright law, including what can be copyrighted, how copyright 
is secured and what rights an owner has, and what constitutes infringement. He then 
tumed to the application of copyright statutes and case law to archives and libraries, 


17 


focusing on copying provisions and limits under section 108 of the Copyright Act. Noting 
that the law focuses on paper and textual materials and does not deal with digital material 
or graphics, audiovisuals, sound, and the like, he then speculated on the applicability to 
digital archives. Hardy argued that digital archiving projects should assume that the basic 
provisions of the Copyright Act apply and work within that context, focusing on archive 
copying of textual materials only. He warmed that digital copying of any printed textual 
materials without permission is not authorized, given the prohibition on conversion into 
searchable text, but suggested that a digital copy of a digital text would be authorized 
within the limits of the act regarding use. The situation is even less clear in regard to 
online digital archives, and Hardy indicated that online access would probably be viewed 
as constituting making and distributing a copy of a work and that it is unclear how 
provisions regarding copying excerpts for research purposes or whole works when copies 
cannot otherwise be bought would apply in the more complicated digital environment. His 
assessment was that almost all digital materials are copyrighted and almost any use is 
infringement unless permitted by the owner or by statute. To minimize liability for 
damages, he advised focusing archiving on research purposes, confining archives to 
textual materials only and sites not still available, and being prepared to omit or delete any 
record or information if the copyright owner objects. His best advice was to secure 
permission before assembling a digital archive and to remember that distribution without 
permission or even allowing browsing could constitute infringement of copyright. 


Rotenberg, in the second presentation, began by defining privacy as a legal 
concept, focusing on claims against the collection, use, maintenance, or disclosure of 
personally identifiable information. He reviewed evidence of the importance of privacy to 
Internet users and stressed the role of privacy and anonymity in facilitating 
communications and information retrieval. While intrusions into private life can be 

justified under certain circumstances, ongoing or routine surveillance is more problematic 
and must be avoided in digital archiving projects. Rotenberg concluded that the debate 
over protection and privacy should not be "either/or"; rather, our goal should be both 
higher public access to information and privacy protection--"a public archive that respects 


private life.” 


Van Camp noted that much of what Hardy and Rotenberg said was not new to 
archivists but that it was important that it be said by authorities on the issues. She 
expressed concer that information policy is too piecemeal and frustrating, reflecting the 
lack of voice of archivists and other professionals in policy-making. Indeed, what exists is 
largely case law rather than guidelines or regulations. And there is even less direction for 


18 


preserving information outside of government. Referring to the Task Force report 


summarized by Waters earlier, she called for legal structures and mandates to protect 
cultural properties at risk. 


As chair, Plesser noted the challenge of saving both public and private information 
and records, emphasizing that the law affects different actors in different ways. The 
discussion that followed focused largely on the issue of access. Plesser argued that 
providing access to data is more problematic than how that data is actually used, and 
Rotenberg noted, for example, that the press has more latitude than the data holder in 
regard to privacy issues. Stephenson asked about the responsibility or liability of archives 
which preserve or provide electronic access to survey data files containing sensitive 
information. While the anonymity of the individuals surveyed is supposedly protected, 
there might be some way to manipulate the data and identify individuals. Cerf noted that 
the inferential identification of individuals is sometimes possible. Plesser responded that 
archives have the same obligation in regard to electronic files as to paper--to maintain 
appropriate standards of care in controlling access and protecting subjects. Archives must 
make sure that they are abiding by the terms of any contracts or agreements with 
individual subjects regarding anonymity and access to data. Archives should also keep in 
mind that the retrieveability of individual records or information may trigger privacy 
concems even for records normally viewed as public. Also, the kind of information would 

- be an issue--would its release be highly offensive to a reasonable person? Finally, Plesser 
noted that servers may share liability, since they facilitate the distribution of information. 


Session: ”How Do We Make Electronic Archives Usable and Accessible?" 
Chair: Jim Miller, Massachusetts Institute of Technology and the World Wide 
Web Consortium 
Paper: Margaret Hedstrom, School of Information, University of 
Michigan 
Response: 
C. M. Sperberg-McQueen, University of Illinois, Chicago, and the Text 
Encoding Initiative 
Bill Bames, Slate Magazine, Microsoft Corporation 
Hedstrom began by noting that electronic archives already exist--the first were set 


up as long as thirty years ago, so some archivists have experience in the area. But there 
are problems, including the lack of integrated access systems, the length of time required 


x 


pRRRER REAR RRB RR SBR ERE ES 2 


19 


to retrieve often off-line media, and the high cost in time and money for researchers and 
repositories. In terms of accessibility, she argued that the context should be user needs 
but that archives have not undertaken the systematic and comprehensive analysis of users 
and their needs that such an approach requires. Hedstrom proposed access systems that 
will span a widely distributed system of electronic archives--not a single repository--and 
make it possible for an individual to find the information or records sought regardless of 
the location of the repository. She suggested building on archives’ experience with access 
systems such as the Research Libraries Information Network (RLIN) and on current 
efforts to develop a standard for Encoded Archival Description using SGML and argued 
for tieing electronic archives access systems into existing access systems for other 
materials. In order for electronic archives to be usable, according to Hedstrom, we must 
have contextual information--archives are not self-explanatory or self authenticating. 
Information about the origin, creation, purpose, use, and chain of custody is essential for 
evaluating and interpreting archival records, and she described a metadata model 
developed at the University of Pittsburgh. She argued that the biggest problem may be 
maintaining access to the software necessary to open, view, and analyze electronic 
materials and proposed distributed responsibility, with access to software or emulators 
available on-line. Moreover, preserving software is important to do in any case as part of 
the history of the digital age. In regard to privacy issues, she suggested that electronic 
archives should draw upon the experiences of other archival repositories in developing 


- access policies. Finally, she returned to the user, calling for new strategies and tools to 


facilitate individual record keeping and new initiatives to educate users to evaluate and 
interpret electronic evidence. 


In his response, Sperberg-McQueen emphasized the importance of Hedstrom's call 
for multiple electronic archives rather than a single archive--this must be a distributed 
activity. But there must also be integrated, unified access, something between high 
relevance, low retrieval systems such as RLIN and high retrieval, low relevance systems 
such as Altavista. The key, he argued, is standardization of descriptive information using, 
for example, Dublin Core Metadata. He agreed with Hedstrom's emphasis on reliability, 
authenticity, and usability and her concerns about software, although warning that 
emulators would have to be tested and certified. Ultimately, however, he maintained that 
archives will have to migrate data and that those employing standards such as HTML will 
survive better than others. 


Bill Barnes, Microsoft Corporation, provided a second response. He raised 
concems about preserving snapshots of the Internet without the interactive experience 


20 


associated with the sites and suggested that we might need to consider a "simuline 
archive." He also supported self-archiving and described efforts at Slate Magazine to 
provide continuing access to past publications. 


As chair, Jim Miller began the discussion by urging compromise, warning that if 
archivists push for too much metadata they might end up with none. He also urged that all 
concemed keep in mind the responsibility for "perpetual care," for maintaining long term 
access. Kahle suggested that access is less an issue than usability, but Hedstrom 
countered that too much accessibility could make usability problematic. Thibodeau 
expressed concer that authentication is called into question when others are able to 
access and use data for their own purposes. 


The participants also discussed further the idea of self archiving raised by Barnes. 
Roberts indicated that that might work within corporations but that it would require clear 
standards and some system of certification. Lubar also raised questions about self- 
archiving, asking what should be built-in in terms of contextual and other information. 


Hensen suggested that self archiving should also include a means of linking archives to the 
work produced from them. 


Finally, Roberts suggested that maybe the best place to start with electronic 


- archiving is now, moving forward rather than trying to recapture the past, and Hedstrom 
agreed. 


Session: ”Taking Action: What Do We Do Next? Who Will Take Responsibility? 
Who Will Provide Funding?" 


Chair/discussion leader: Philip L. Cantelon, History Associates Incorporated 
Comments: 
Vinton G. Cerf; MCI Communications Corporation 
Mark Luker, NSFNet Program, National Science Foundation 
Kenneth F. Thibodeau, Center for Electronic Records, National Archives 
and Records Administration 


Cerf spoke first in this concluding session. He discemed five critical issues or 
concems in developing digital archives. First, that there are many legal and social 
potholes that must be addressed. Second, that other products and services, such as 
backups and indexing tools, can contribute to this initiative. Third, that we must begin 


21 


addressing international policy issues. Fourth, that authentication of digital archives is 
critical. Fifth, that we have not yet come up with a satisfactory definition of what a 
"work" is and how we determine authorship. 


Luker then urged that the focus shift from the Internet to the larger issue of 
documenting the digital age but predicting that will entail higher costs. Looking at who 
might pay for this, he argued that we need to emphasize both current and future value. 
That means, for example, taking steps to ensure that information providers have the 
opportunities to take advantage of market interest in their services. We need to enlist 
stakeholders, including the government, in advocating policies and laws that will facilitate 
action and reduce barriers. Luker also discussed the role of NSF both in the development 
of Internet 1 and in planning for the Next Generation Internet (integrating federal research 
networks and industry) and Internet 2 (being organized by universities). These two new 
initiatives not only promise more reliable access on a greater scale but also provide 
important opportunities to build on lessons leamed and address continuing problems. 


Thibodeau was the third speaker. He urged taking an archival perspective, 
focusing on customers that are generations away and enabling the future to understand its 
past. That requires more than just information; documenting our age requires saving not 
only records but the historical context within which they were created. Without that 

- context, we have nothing more than a garbage heap of information. According to 
Thibodeau, "What we need is as rich a record as we can afford." So, who will pay for 
this? Noting that the National Archives and Records Administration invests only token 
amounts in electronic records (about 2 percent of its annual budget), he suggested that the 
resources will come through the efforts of dedicated champions, determined to see this 


through to fruition. 


Finally, Cantelon asked participants to consider how we should move forward. 
For example, should this effort be divided up into separate issues such as international 
activity, the Internet, records, and advocacy and legal issues? He charged each to write to 
him after the conference with concrete suggestions for next steps and commitments to 
work on specific initiatives and urged all to take the discussion to larger public and 


professional forums. 


22 


FOLLOW-UP COMMENTS 


3 . its 
In response to Cantelon's charge, fifteen of the participants sent written nga 
regarding their assessment of the issues and their suggestions for future directions. The 
following summarizes those responses. 


A number of the participants urged further discussion about what should be 
archived, emphasizing the difference between the history of the digital age and hist Veen 
the digital age, with particular interest in expanding the agenda to encompass the wider 
issue of electronic records. There was also continuing discussion of the debate over 
saving all of the Intemet and saving only a portion, between saving a snapshot of the 
Internet and trying to capture the continuum of Internet activity. Rotenberg urged that we 
move beyond "either/or" debates and start focusing on the positive reasons for digital 
archives. Ceruzzi suggested that the issue is probably moot anyway, given that efforts to 
save "all" are already underway, and Lubar concluded that we have probably lost much of 
the documentation of the early years and should not try to recapture it. Hedstrom, 
however, argued for stepping back and undertaking user studies to determine what users 
will actually need rather than what we think they will want. 


Several participants also brought up points regarding how to go about archiving 
“digital material. The archivists remain concemed that basic archival techniques of 
appraisal, selection, preservation, and description are not being employed and that critical 
access issues are not being addressed on the front end of projects. Hedstrom argued for 
attention to user needs and for a more careful assessment of costs and tradeoffs entailed in 
pursuing different strategies, and Van Camp stressed the importance of developing 
integrated access systems. Several participants also explored further possibilities of self 
archiving, from Hedstrom's suggestion of more attention to personal record keeping to 
suggestions by Hardy and Lubar that the PICS system might provide an "archive/not 
archive" and/or metadata options for the creator of a record or document. Ceruzzi, 
Lubar, and Stephenson urged further investigation of the possibility of imbedding 
archiving functions in software, browsers, or even the Web. Ceruzzi's concern was that 
otherwise, as nonarchivists undertake these projects, archival concerns will not be 
addressed. Stephenson had specific proposals, suggesting that standardized file structure 
version control, specific formats, and long-term storage capabilities could be imbedded 
into computing and communications technology. Several participants noted the need for 
further work on archival standards, particularly in regard to access and description, and 


—— a ji see) 


23 


Lubar argued that HTML and SGML formats will make archiving possible. Finally, 
Hedstrom raised concerns about migration and software obsolescence. 


There was also considerable discussion about how to resolve legal problems re 
privacy and copyright and the further complications when archiving moves beyond the 
U.S. Rotenberg warmed that the international arena will present more hurdles, but 
Stephenson suggested looking to the European experience for possible solutions 
(particularly the International Federation of Data Organization and the European Social 
Science Data Associations). Plesser expressed continuing skepticism about the likelihood 
of gaining access to corporate records, and Hedstrom suggested seeking a legislative 
remedy through an exception for "archiving purposes." 


Several respondents expressed their strong support for the Internet Archive and 
suggested ways to help, including setting up archival procedures and sharing information 
about it with archivists and other interested communities. 


24 


FINDINGS 


The following issues emerged from conference sessions and discussion and post- 


conference comments, Many involve questions about direction, reflecting the lack of 
consensus among the varied stakeholders and the complexity of the task of archiving 


digital material. These points are grouped under three larger questions which structured 
the conference. 


What should be saved? 


Digital materials alone will be insufficient to document the late twentieth century 
and should be viewed as part of a larger documentation strategy. We should not 
focus on the medium to the exclusion of larger documentation issues. 

Digital archive projects must save both the history of the digital age and history in 
the digital age, both the history of the Internet and the history on the Internet. And 
the context must be global. 

Archiving the digital age involves making difficult choices, each involving different 
costs, benefits, and tradeoffs--between trying to recapture the early history and 
beginning with what we have now, between saving all the Web and selecting and 
archiving only parts of it, between capturing a snapshot of the Web and trying to 
preserve its characteristic activity and changing nature. 

Archiving the Internet is only part of the challenge--documenting the digital age 
requires documenting not only public communication but also private and 
corporate. Archiving corporate records is particularly critical but significantly 
hindered by disincentives and obstacles in the current legal system. And the issue 
is not just text and data but also images, sound, executable pages, and other 
formats. 

What can be saved or archived is significantly limited by legal and privacy 
concems. Until there is legislative change, the only safe ground is to focus on 
copying or saving digital texts for research purposes, with online access and the 
archiving of nontextual materials (images, sound, etc.) highly problematic. All 
these issues become even more problematic when viewed in an international 
context. 

Decisions about what to save or archive must take costs into account and address 
the value to and needs of the larger society and economy. 


Who should take responsibility? 


Multiple, distributed archives are preferable to a single, centralized Tepository, 


25 


minimizing risk through redundancy and addressing the varied interest and needs 
of stakeholders. 

The creators of information and records should also bear responsibility, and this 
can best be achieved through imbedding a self archiving function in software and 
services. 


How should digital material be saved or archived? 


Digital archiving efforts can be built on backup and indexing services already in 
operation, but those do not constitute an alternative to developing a 
documentation strategy. 

Although digital technology makes it cheaper and easier to save everything, that 
approach does not address archival concems about preserving contextual 
information. Decisions about saving or archiving digital information must address 
the costs, benefits, and tradeoffs of these alternative strategies. 

Digital archiving involves not just saving data bits but also preserving the 
functionality of information and records. Software and hardware constitute part of 
the context and must be saved as well. 

The instability of electronic content makes authenticity and integrity particularly 
problematic, and possible solutions include digital signatures and imbedding 
metadata in the document or record. 

Long term access and usability must be addressed as well through migration 
strategies and archival standards. 


26 


NEXT STEPS 


The consensus of the conference participants seems to be that, while issues 
may need to be addressed separately (such as archiving the Internet versus the larger issue 
of electronic records), the mix of individuals and interests should be maintained. To move 
forward, we probably should broaden the participation to include more representatives of 
software and hardware developers and manufacturers, who can help address some of the 
"imbedding" proposals and whose visibility and influence can be employed in advocacy 
efforts. We should also expand the involvement of organizations already represented 
(such as the International Association of Social Science Information Service and 
Technology and the Research Libraries Group) and share the work of our group with the 
larger professional communities within which we work, including newsletters and 
listserves focusing on records management (ARMA and the National Association of 
Government Archivists and Records Administrators), archives (the Society of American 
Archivists), and electronic records; related projects and offices such as the Applications 
‘Council of the National Science and Technology Council (Committee on Computing, 
Information, and Communications), NIH's National Center for Research Resources, NSF's 
Digital Libraries initiative, the President's Government Information Technology Services 
Board; and existing archives such as the Charles Babbage Institute, the Silicon Valley 
Archives (Stanford), and the Business Records Centre (Glasgow). Hedstrom has 

"proposed sharing a forthcoming report from a conference she convened in 1996 re 
research issues, and it will be essential that we coordinate our efforts with those of other 
projects such as Hedstrom's and, in particular, the CPA/RLG Task Force on Archiving 
Digital Information. Specific suggestions re the latter include pursuing the 
recommendation re an “aggressive rescue function" (Lubar) and contacting the RLG 
working group assessing United Kingdom and Australian Tesponses (Van Camp). 


To follow up on the conference, History Associates has secured support from 
the Commission on Preservation and Access (CPA) to bring together authors of major 
reports and studies (including those sponsored by CPA/RLG and the University of 
Michigan) and representatives of federal funding agencies (NEH, NHPRC, and NSF), key 
organizations (CPA and RLG), and other stakeholders (such as the Department of 
Defense, MCI Communications Corporation, and the National Archives and Records 
Administration). The purpose of the one-day meeting, scheduled for May 23, 1997, is to 
review the status of efforts to archive digital information and records, explore how that 
work can be coordinated better, discuss funding possibilities, and develop a national 
agenda for archiving digital information and records. 


ae —_——— a ; 


27 


APPENDIX A 
LIST OF PARTICIPANTS 
AND AFFILIATIONS 


Bill Barnes 

Program Manager, Slate Magazine 
Microsoft Corporation 

One Microsoft Way 

Redmond, WA 98052 

Phone: 206/936-4893 

Fax: 206/703-3128 

E-mail: billba@microsoft.com 


Gwen Bell 
Director of Collections 
The Computer Museum History 


Center 

P.O. Box 3038 

Stanford, CA 94309-3038 
Phone: 408/562-7915 
Fax: 408/562-7915 
E-mail: bell@tcm.org 


Francis X. Blouin 

Director 

Bentley Historical Library 
University of Michigan 
1150 Beal Avenue 

Ann Arbor, MI 48109-2113 
Phone: 313/764-3482 

Fax: 313/936-1333 

E-mail: fblouin@umich.edu 


Philip L. Cantelon 

President 

History Associates Incorporated 
5 Choke Cherry Road 


Suite 280 
Rockville, MD 20850-4004 


Phone: 301/670-0076 
Fax: 301/670-2765 
E-mail: peantelon@mcimail.com 


Vinton G. Cerf 

Senior Vice President of Internet 
Architecture and Engineering 

MCI Communications Corporation 
2100 Reston Parkway 

Reston, VA 20191 

Phone: 703/715-7432 

Fax: 703/715-7436 

E-mail: vcerf@mci.net 


Paul Ceruzzi 

Curator 

National Air and Space Museum 
Smithsonian Institution 

7th and Independence 
Washington, DC 20560 

Phone: 202/357-2828 

Fax: 202/786-2947 

E-mail: nasem001@sivm.si.edu 


James B. Gardner 

Project Director 

Conference on Documenting the 
Digital Age 

4000 Massachusetts Avenue NW 
#228 

Washington, DC 20016-5 108 
Phone: 202/363-3420 

Fax: 202/785-3948 

E-mail: 
104316.2753@compuserve.com 


Adam L. Gruen 

Corporate Historian 

MCI Corporate Archives 
1133 19th Street NW 
Washington, DC 20036 
Phone: 202/736-6290 

Fax: 202/736-6289 

E-mail: adamgru@well.com 


Trotter Hardy 

William and Mary School of Law 
South Henry Street 

Room 216 

Williamsburg, VA 23187 
Phone: 757/221-3826 

Fax: 757/221-3261 

E-mail: thardy@facstaff.wm.edu 


Margaret Hedstrom 

Associate Professor 

School of Information 
University of Michigan 

550 E. University Avenue 
Room 304 

Ann Arbor, MI 48109-1092 
Phone: 313/747-3582 

Fax: 313/764-2475 

E-mail: hedstrom@umich.edu 


Steven Hensen 

Librarian and Assistant Director, 
Special Collection Library 
Perkins Library 

Box 90185 

Duke University 

Durham, NC 27708 

Phone: 919/660-5820 

Fax: 919/684-2855 

E-mail: hensen@acpub.duke.edu 


Birgit Ireland 

Deputy Director for Information 
Resources Management Services 
History Associates Incorporated 
5 Choke Cherry Road 

Suite 280 

Rockville, MD 20850-4004 
Phone: 301/670-0076 

Fax: 301/670-2765 

E-mail: HAIBPI@mcimail.com 


28 


Brewster Kahle 

President, Internet Archive 
Presidio 

Building 116 

P. O. Box 29141 

San Francisco, CA 94129 
Phone: 415/561-6793 

Fax: 415/561-6795 

E-mail: brewster@archive.org 


John C. Klensin 

Senior Data Architect 

MCI Communications Corporation 
800 Boylston St., 7th floor 
Boston, MA 02199 

Phone: 617/960-1011 

Fax: 617/960-1009 

E-mail: klensin@mci.net 


Steven Lubar 


’ Chairman, Division of the History of 


Technology 

National Museum of American History 
Smithsonian Institution 

Washington, DC 20560 

Phone: 202/357-2371 

Fax: 202/357-4256 

E-mail: lubar@nmah.si.edu 


Mark Luker 

Director, NSFNet Program 
National Science Foundation 
Room 1175 

4201 Wilson Boulevard 
Arlington, VA 22230 
Phone: 703/306-1949 

Fax: 703/306-0621 

E-mail: mluker@nsf.gov 


John McChesney 

National Public Radio 

San Francisco Bureau 

2601 Mariposa Street 

San Francisco, CA 94110 
Phone: 415/487-3200 

Fax: 415/552-4667 

E-mail: jmcchesney@aol.com 


John Markoff 

San Francisco Bureau 

New York Times 

1 Embarcadero Center 
Suite 1310 

San Francisco, CA 94111 
Phone: 415/362-3912 
E-mail: markoff@nyt.com 


29 


Jim Miller 

Domain Leader, Technology and 
Society 

World Wide Web Consortium 
Massachusetts Institute of Technology 
Laboratory for Computer Science 
545 Technology Square 
Cambridge, MA 02139 

Phone: 617/253-2613 

Fax: 617/258-5999 

E-mail: jmiller@w3.org 


Michael Miller 

Director 

Records Management Programs 

National Archives and Records 

Administration 

8601 Adelphi Road 

College Park, Maryland 20740 

Phone: 301/713-7110, ext. 229 
Fax: 301/713-6852 

E-mail: mike.miller@arch2.nara.gov 


James M. Muldoon 
Professor 

Department of History 
Rutgers University-Camden 
311 North Fifth Street 
Camden, NJ 08102 
Phone: 609/225-6069/6080 
Fax: 609/225-6602 


30 


Nathan Myhrvold 

Chief Technology Officer 
Microsoft Corporation 

One Microsoft Way 

Redmond, WA 98052-6399 
Phone: 206/936-7333 

Fax: 206/936-1222 

E-mail: nathanm@microsoft.com 


Ronald Plesser 

Partner 

Piper and Marbury 

1200 19th Street NW 
Washington, DC 20036 
Phone: 202/861-3969 
Fax: 202/223-2085 


E-mail: rplesser@pipermar.com 


Bert C. Roberts, Jr. 

Chairman 

MCI Communications Corporation 
1801 Pennsylvania Avenue NW 
Washington, DC 20006 

Phone: 202/887-2166 

Fax: 202/887-2178 

E-mail: bert@mci.com 


Marc Rotenberg 

Director 

Electronic Privacy Information Center 
666 Pennsylvania Avenue SE 

Suite 301 

Washington, DC 20003 

Phone: 202/544-9240 

Fax: 202/547-5482 

E-mail: rotenberg@epic.org 


M. C. Sperberg-McQueen 
Senior Research Programmer 
Computer Center (M/C 135) 
University of Illinois at Chicago 
1940 W. Taylor St. 

Room 124 

Chicago, IL 60612-7352 
Phone: 312/413-0317 

Fax: 312/996-6834 

E-mail: cmsmcq@uic.edu 
tei@uic.edu 


Elizabeth Stephenson 

Data Archivist 

Institute for Social Science Research 
University of California, Los Angeles 
Box 951484 

Los Angeles, CA 90095-1484 
Phone: 310/825-0716 

Fax: 310/206-4453 

E-mail: libbie@ucla.edu 


Kenneth F. Thibodeau 
Director 

Center for Electronic Records 
National Archives and Records 
Administration 

8601 Adelphi Road 

College Park, MD 20740-6001 
Phone: 301/713-6630 

Fax: 301/713/6911 

E-mail: 

ken. thibodeau@arch2.nara. gov 


Anne Van Camp 

Member Services Officer 
Research Libraries Group 

1200 Villa Street 

Mountain View, CA 94041 
Phone: 415/691-2237 

Fax: 415/964-0943 

E-mail: BL.AHV@RLG.ORG 


Donald Waters 

Associate University Librarian 
Yale University 

116B 

Sterling Memorial Library 
P.O. Box 208240 

New Haven, CT 06520 
Phone: 203/432-4889 

Fax: 203/432-7231 


E-mail: donald.waters@yale.edu 


31 


32 


APPENDIX B 
PARTICIPANT BIOGRAPHIES 


BILL BARNES is Program Manager and part-time technology columnist for S/ate 
magazine, Microsoft Corporation's online journal of politics, policy and culture. He 
received his B.S. in mathematics from Carnegie Mellon University in 1989, and was 
immediately recruited by Microsoft, where he has spent the last eight years working on 
consumer software. Mr. Bares shipped five versions of Microsoft Works, helped 
develop the original Microsoft Network, and coordinated the Titanium project, an effort 
to tag Internet content for easier search and navigation. 


GWEN BELL is the founding president of The Computer Museum, Boston, and Director 
of Historical Collections for The Computer Museum History Center, Silicon Valley. The 
Computer Museum was established in 1979 within Digital Equipment Corporation and 
became a public non-profit educational institution in 1982, moving to downtown Boston 
in 1984. In 1996, The Computer Museum History Center was established in Silicon 
Valley. The collection includes 2,000 artifacts ranging from the Whirlwind Computer to 
microprocessors; and documentation, especially manuals, photographs, films, video, and 

. ephemera. Prior to founding the Museum, Dr. Bell taught and did international consulting 
in city planning. She has a B.S. from the University of Wisconsin, MCRP from Harvard 
University's Graduate School of Design, and a Ph.D. in Geography from Clark University. 


FRANCIS X. BLOUIN received his A.B. from Notre Dame University and his M.A. and 
Ph.D. in history from the University of Minnesota. He joined the staff of the Bentley 
Historical Library at the University of Michigan in 1974. The library is the principal 
repository for personal papers and private organizational records in the State of Michigan 
and also houses the archives of the University. Blouin joined the faculty of the 
Department of History in 1975 and joined the faculty of the School of Information (then 
Library Science) in 1978. He was named Director of the Bentley Library in 1981. Blouin 
has directed many projects, including one documenting migration and ethnicity, and 
another documenting the temperance and prohibition movements. He is currently working 
on a project to enhance access to the Archivio Segreto Vaticano. 


PHILIP L. CANTELON is a graduate of Dartmouth College and holds advanced 
degrees in history from the University of Michigan and Indiana University. He taught 
contemporary American history at Williams College from 1968-1977 and was a Fulbright 


33 


Professor of American Civilization in Japan. Dr. Cantelon is an oral history consultant, 
and president and chief executive officer of History Associates Incorporated, a 
professional historical, archival, and records management services company based in 
Rockville, Maryland. His publications include Crisis Contained, a history of the Three 
Mile Island nuclear accident; The American Atom, a college text on American nuclear 
policy; The History of MCI, 1968-1988: The Early Years, a study of the role of 
entrepreneurship and corporate culture in the growth and success of MCI; 7he Roadway 
Story, a corporate history focusing on the effect of government regulation and 
deregulation on the trucking industry; and Corporate Archives and History: Making the 
Past Work, a primer on the value and use of historical materials by businesses. He 
currently serves on the Organization of American Historians’ Committee on Research and 


Access to Historical Documentation. 


VINTON G. CERF is Senior Vice President of Internet Architecture at MCI 
Communications Corp. and is responsible for the development of MCT's Internet Network, 
the world's fastest and largest Internet backbone. The co-developer of the computer 
networking protocol, TCP/IP, the language for Internet communications, Dr. Cerf played 
a major role in sponsoring the development of Intemet-related data packet technologies 
during his stint with the Department of Defense's Advanced Research Projects Agency 
(ARPA) from 1976 to 1982. A graduate of Stanford University with a Ph.D. from 
U.C.L.A., Cerf is a fellow of the Institute of Electrical and Electronic Engineering (IEEE), 
_the Association for Computing (ACM), the American Association for Advancement of 
Science, and the American Academy of Arts and Sciences. He is the recipient of 
numerous awards and commendations in connection with his work on the Internet. 


PAUL E. CERUZZI is Curator of Aerospace Electronics and Computing at the 
Smithsonian's National Air and Space Museum in Washington, D.C. He received a B.A. 
from Yale University in 1970 and a Ph.D. in American Studies from the University of 
Kansas in 1981. His graduate studies included a year as a Fulbright Scholar at the 
Institute for History of Science in Hamburg, West Germany, and he received a Charles 
Babbage Institute Research Fellowship in 1979. Before joining the staff of the National 
Air and Space Museum, he taught History of Technology at Clemson University in 
Clemson, South Carolina. Ceruzzi is the author or co-author of several books on the 
history of computing and related issues: Reckoners, The Prehistory of The Digital 
Computer (1983), Computing Before Computers (1990), Smithsonian Landmarks in the 
History of Digital Computing (1994), and Beyond the Limits: Flight Enters the Computer 
Age (1989). He has served as a consultant on the BBC television series "The Machine 
that Changed the World," and is presently working on a comprehensive book on the 
origins and history of computing in the United States. 


34 


JAMES B. GARDNER holds a B.A. in history from Rhodes College and an M.A. and a 
Ph.D. in history from Vanderbilt University. He currently works as a consultant with 
History Associates Incorporated and LaPaglia & Associates. Previously he served as 
Deputy Executive Director of the American Historical Association (1986-1996) and on 
the staff of the American Association for State and Local History (1978-86). 

Dr. Gardner's publications include Ordinary People and Everyday Life: Perspectives 0” 
the New Social History, A Historical Guide to the United States, and contributions to 
several anthologies and periodicals. His most recent project is Public History: Essays 
from the Field, a collection of twenty-six essays he is editing for publication in late 1997. 
Dr. Gardner was a guest lecturer in the Public History Graduate Program at Arizona State 
University in the spring of 1996 and frequently appears on the programs of such 
organizations as the American Association of Museums, the American Council of Learned 
Societies, and the Association of American Colleges and Universities as well as at 
meetings, conferences, and seminars sponsored by local, state, and regional organizations. 
He currently serves on the Board of Editors of The Public Historian and the Academic 
Advisory Committee of the James Madison Memorial Fellowship F oundation and chairs 
the G. Wesley Johnson Award Committee for the National Council on Public History. 


ADAM L. GRUEN, Corporate Historian, MCI Library, was formerly Project Director 
for NASA's Space Station History Project from 1985-1993. Dr. Gruen is a historian of 

_ technology with special expertise in the history of aerospace, networking systems, and 
modern structural and civil engineering. A graduate of Duke University (1989), Dr. 
Gruen studied with Dr. Alex Roland, former President of the Society for History of 
Technology. Dr. Gruen came to the D.C. area in 1984 on a Guggenheim Fellowship at 
the National Air and Space Museum. 


L TROTTER HARDY was graduated Order of the Coif from Duke University, where he 
served as Article Editor for the Duke Law Journal. After graduation, he clerked for the 
Honorable John D. Butzner, Jr., on the Federal Court of Appeals for the Fourth Judicial 
Circuit. Following that clerkship, he joined the law faculty at the College of William & 
Mary, where he teaches Intellectual Property, Torts, and Legal issues of the Internet. 
Professor Hardy is the author of articles on the design of computer command languages, 
international data flows, health law, law and computers, copyright law, and the legal issues 
of the Intemet. His current research interests include copyright and new technologies; 
governmental regulation of the Internet; and electronic journals. He is the moderator of 
an Internet discussion list called "CYBERIA" dealing with the law and policy of computer 
networks. He also founded and is the current editor of "The Journal of Online Law," an 
electronic journal dealing with computer communications legal issues. Professor Hardy 
served as Scholar in Residence and Technical Advisor to the Register of Copyrights from 


eelaiedntietiiandealeintesinaiee aie 


35 


June to December 1996. In that capacity, he directed Project Looking Forward, a project 
to predict future issues of digital technology and copyright. 


MARGARET HEDSTROM is an Associate Professor at the School of Information, 
University of Michigan, where she teaches in the areas of archives, electronic records 
management, and digital preservation. Before joining the faculty at Michigan in 1995, she 
worked for ten years at the New York State Archives and Records Administration where 
she was Chief of State Records Advisory Services and Director of the Center for 
Electronic Records. Dr. Hedstrom eamed Master's Degrees in Library Science and 
History and a Ph.D. in History from the University of Wisconsin-Madison where she 
wrote a dissertation on the history of office automation in the 1950s and 1960s. She has 
published widely on many aspects of archival management, electronic records, and 
preservation in digital environments and served as a consultant to many government 
archival programs. Her current research interests include digital preservation strategies, 
the impact of electronic communications on organizational memory and documentation, 
and remote access to archival materials. She is a fellow of the Society of American 
Archivists and was the first recipient of the annual Award for Excellence in New York 
State Government Information Services. 


STEVEN L. HENSEN is Director of Planning and Project Development in the Special 


Collections Library at Duke University. He is the author of Archives, Personal Papers, 


and Manuscripts, the AACR2-based standard for archival cataloging, and of numerous 
articles and papers in the area of archival description and standards, as well as over 40 
workshops and consultancies in archival cataloging and description and the use of the 
USMARC format. He is a graduate of the University of Wisconsin-Madison, where he 
also received his MLS in 1971. He has worked in special collections/archives at the State 
Historical Society of Wisconsin, the University of Chicago, Yale University, and the 
Library of Congress and has served as Program Officer for Archives, Manuscripts, and 
Special Collections at the Research Libraries Group. He is a Fellow of the Society of 
American Archivists and is currently serving on its governing Council. He was a member 
of the SAA National Information Systems Task Force, the Working Group on Standards 
for Archival Description, the 1989 Airlie House Multiple Versions Forum, the 
development team for EAD, an SGML-compliant standard for encoding archival finding 
aids, and numerous other groups. His current activities involve developments relating to 
enhanced network access to archival information and collections. 


BIRGIT IRELAND is deputy director of Information Resources Management Services 
at History Associates Incorporated. She manages projects and provides archives and 
records management services to private and public organizations. Since joining History 


36 


Associates in 1991, her assignments have included the establishment of the MCI 4 
Corporate Archives and developing a comprehensive records system for the Nation 
Institute of Occupational Safety and Health. Recently, she advised the Panama Can 
Commission on records management requirements for the Commission's primary 
electronic systems, pending the tumover of Canal operations to the Republic of Panama, 
and identified historically valuable data files for transfer to the National Archives. 

Ms. Ireland holds an M.A. in history and a Master of Library Science degree with a 
specialization in archives management from the University of Maryland in College Park. 
She is a certified archivist and member of the Academy of Certified Archivists. 


BREWSTER KAHLE is a founder of the Internet Archive in April 1996. Before that, he 
was the inventor of the Wide Area Information Servers (WAIS) system in 1989 and 
founded WAIS Inc. in 1992. WAIS helped bring commercial and government agencies 
onto the Internet by selling Internet publishing tools and production services to companies 
such as Encyclopedia Britannica, New York T; imes, and the Government Printing Office. 


Schooled at MIT, Brewster designed super computers in the 80s at Thinking Machines 
Corporation. 


JOHN C. KLENSIN is Senior Data Architect in MCI's Intemet Architecture Department 
where he has lead responsibility for the design of many of MCT's Internet applications 

. Services and products. Prior to coming to MCI in mid-1994, he was INFOODS Project 
Coordinator for the United Nations University and, before that, was at MIT for nearly 30 
years, holding Principal Research Scientist appointments in several departments including 
Architecture, the Center for International Studies, and the Laboratory of Architecture and 
Planning. Earlier, Klensin was technical director of the Cambridge Project, a joint 
Harvard-MIT activity whose goal was to advance the state of social and behavioral 
science computing, and principal architect of its main product, the "Consistent System." 
In that role, he made contributions to the applications interface design of the Multics 
system, participated actively in the group that defined what became the Internet's File 
Transfer Protocol, and was the co-designer of some of the ARPANet's earliest trials in 
transparent multiple-host applications for non-computer-specialists. 


STEVEN LUBAR is Chairman of the Division of the History of Technology at the 
Smithsonian Institution's National Museum of American History. A 1976 graduate of 
MIT in humanities and engineering, he received his Ph.D. in the history of science and 
technology from the University of Chicago in 1983. At the Smithsonian Institution since 
1982, he has worked on many exhibitions, including Engines of Change, a history of the 
American Industrial Revolution and Information Age, a history of communications and 


37 


information processing. He is the author of six books and over forty articles on the _ 
history of technology, material culture, and museum studies. Dr. Lubar has been a visiting 
professor at the University of Maryland and the University of Pennsylvania. 


MARK LUKER eamed his Ph.D. in 1975 from the University of California at Berkeley. 


He served on the faculty and was Acting Dean of Science and Engineering at the 
a, Duluth. He performed research in the theory of computer 
ork support for distributed 


University of 


University of Minnesot 
languages, microcomputer systems for education, and netw 
microcomputers. He has been Chief Information Officer of the 
Wisconsin-Madison since 1992, leading the campus in strategic planning for networks and 
information technology and a 600-person division in the delivery of technology services. 
He has also been a member of the National Learning Infrastructure Initiative, on the Board 


of Trustees of EDUCOM, Chair of the Board of CICNET, and is now Program Director, 
NSENET Program, in charge of NSF's vBNS advanced network program and 
high-performance connections for research and education, the routing arbiter and NAPs, 
and related networking development projects. Dr. Luker is active in the definition of the 
federal Next Generation Internet project and the university-based Internet 2. 


JOHN MCCHESNEY is National Public Radio's reporter on the evolution of the digital 
age and the host of HotSeat, a column on the Internet "zine" HotWired. A graduate of 
Southern Methodist University, McChesney did graduate work in American literature at 
Stanford University and taught six years at Antioch College in Yellow Springs, Ohio. He 


started with Morning Edition on NPR in 1979 in charge of domestic news outside of 


Washington, and later became the foreign editor. McChesney reported on the "great 
competitive struggle" between the United States and Japan for semiconductor supremacy 
and now reports on Silicon Valley and the development of the digital age. A self- 
described "techno-illiterate,” McChesney sees his job as an electronic journalist who can 
clearly explain the Web's culture and technology to the uninitiated and cut through the 


technobabble to the real story. 


JIM MILLER joined the World Wide Web Consortium in June 1995 having designed 

and implemented numerous innovative and useful real-world systems over more than 
twenty years. His work involves people interacting with computers to perform tasks 
better than either can do alone. He creates systems which allow each partner in the task to 
understand the other's abilities and limitations, enabling each to make informed decisions 
about the division of labor. His work deals with creating simple models of what the 
computer does and conveying them to the human partners. Jim is one of the principal 
designers of the PICS (Platform for Internet Content Selection) specifications in addition 
to overseeing development in Payments, Demographics and Privacy, Intellectual Property 


Rights, and Security. 


38 


MICHAEL L. MILLER serves as the Director of Records Management Programs 
the National Archives and Records Administration, with responsibility for directing 4D cat 
managing NARA's records management programs for federal agencies. He was previous i 
National Program Manager for the Records Management and Agency Records Office w! t 
the Environmental Protection Agency. Dr. Miller teaches records management courses @ 
the University of Maryland and Catholic University. He received his B.A. in history ia 
Wheeling College, and his M.A. and Ph.D. in history from Ohio State University. A : 
certified records manager and certified archivist, Dr. Miller has extensive oxpeniesS ” 
policy and program development and innovative records management techniques. He 
currently serves as co-chair of the Interagency Electronic Records Management Work 
Group for the federal government. 


JAMES MULDOON is a graduate of Iona College (1957). He obtained an M.A. at 
Boston College (1959) and a Ph.D. at Cornell University (1965). Since 1970, he has 
taught history at Rutgers University, Camden. Professor Muldoon has been a Fulbright 
Fellow at Trinity College of Cambridge University, a fellow of the Institute for Advanced 
Study, and a National Endowment of the Humanities Research Fellow at the John Carter 
Brown Library of Brown University. A specialist in medieval, legal, and ecclesiastical 
history, Professor Muldoon has written The Expansion of Europe: The First Phase 
(1977); Popes, Lawyers, and Infidels: The Church and the Non-Christian World 1250- 
1550 (1979); and The Americas in the Spanish World Order: The Justification for 

’ Conquest in the Seventeenth Century (1994). Professor Muldoon is currently working on 
a book dealing with the concepts of empire and world order from the thirteenth to the 
seventeenth centuries. 


NATHAN MYHRVOLD is chief technology officer of the Microsoft Corporation and a 
member of the company's Executive Committee. Dr. Myhrvold leads the Advanced 
Technology and Research Group, which determines and funds research and development 
priorities for the company. He is a graduate of the University of California at Berkeley, 
and holds advanced degrees in geophysics, space physics, and mathematical economics 
and a Ph.D. in theoretical and mathematical physics from Princeton University. He did 
postdoctoral work in applied mathematics and theoretical physics at Cambridge University 
where he worked with Professor Stephen Hawking. He joined Microsoft as director of 
special projects in 1986 when Microsoft acquired Dynamical Systems, which Myhrvold 
founded and headed as president and CEO. He serves on the board of trustees of the 
Institute for Advanced Study in Princeton, New Jersey, and is a member of the National 
Information Infrastructure Council. 


39 


RONALD L. PLESSER is a graduate of the George Washington University with a 
degree in English Literature and Law. In 1972, Mr. Plesser joined the Center for Study of 
Responsive Law and was primarily responsible for litigation and legislative activities 
conceming the Freedom of Information Act. In 1975, Mr. Plesser served as General 
Counsel to the U.S. Privacy Protection Study Commission. Currently he is a partner with 
the Jaw firm of Piper and Marbury, Washington, D.C. Mr. Plesser is past-Chair of the 
Individual Rights and Responsibilities Section of the American Bar Association. He has 
been an adjunct professor of law at George Washington University (1982-1986). He also 
was Deputy Director of the Science, Space and Technology Cluster of the 1992 Clinton- 
Gore transition team. Mr. Plesser specializes in issues that concern legislative matters, 
telecommunications, privacy, data base companies, publishers, information and software 
providers and users, marketers, and other companies affected by the emergence of new 
information technologies. 


BERT C. ROBERTS, JR. is chairman of MCI Communications Corporation. A 24-year 
veteran of the company, Roberts played a key role in transforming MCI from a start-up 
long distance company to the leader in global communications that it is today. Over the 
years, he has served in a number of senior management positions that included roles in 
nearly all aspects of the company’s operations. He was named chairman of the company in 
June 1992, after serving as president and chief operating officer since 1985. A native of 
Kansas City, Missouri, Roberts graduated from Johns Hopkins University in Baltimore, 
- where he eared a bachelor of science degree in electrical engineering. 


MARC ROTENBERG is director of the Electronic Privacy Information Center in 
Washington, D.C., and teaches information privacy law at Georgetown University Law 
Center. He has served on numerous national and international advisory panels, including 
most recently the Expert Panel on Cryptography Policy for OECD. He has testified before 
Congress on many issues, including access to information, encryption policy, computer 
security, and communications privacy. A graduate of Harvard College and Stanford Law 
School, Mr. Rotenberg has been named by Newsweek magazine "one of 50 people who 
matter most on the Internet" and by American Lawyer "one of the top 45 public sector 
lawyers under 45." The Electronic Privacy Information Center (EPIC) is a public interest 


research center in W. 
Government to protect privacy, 
litigating ACLU v. Reno, the constitution 
and a series of Freedom of Information Act cases 
government agencies available for public review. 


ashington, D.C., established by the Fund for Constitutional 

civil liberties and constitutional values. EPIC is currently 
al challenge to the Communication Decency Act, 
that seek to make documents from 


40 


: ‘ttf 2 
C. M. SPERBERG-MCQUEEN is (1) the co-editor of the Text Encoding ie are 
cooperative intemational project to develop methods of encoding electronic te ‘Sifts 

in literary, linguistic, historical, or other textual research; (2)a co-coordinator 0 er 
Model Editions Partnership, a consortium of U.S. historical documentary edition aie 
developing experimental models of electronic historical editions, and (3) a aeatiye fe - 
programmer in the network services group in the computer center of the University 
Illinois at Chicago. In all three roles, he is primarily concemed with the electronic ab 
representation of textual and other complex material. The TEI Guidelines define a se 
document type definitions using SGML (the Standard Generalized Markup tapi?) 
suitable for the encoding of pre-existing texts, whether prose, verse, drama, letters e 
memoranda, dictionaries, terminology databases, or mixtures of the above. The Mo e 
Editions Partnership is using the TEI'S markup scheme as the basis for its work with 
modern documentary editions. The UIC computer center is using the TEI scheme as the 
basis for its SGML tagging of its local technical documentation. 


ELIZABETH STEPHENSON has been working with social science data files, such as 
surveys, Census enumerations, and administrative records on one computing platform or 
another for about 20 years. Although her professional interests in preservation, access, 
and retrieval have been concentrated in the social sciences, her early career involved 
preservation of other forms of "records," including photographs and slides of modern 
dance, and theater and dance costumes. She is currently president of the International 

- Association for Social Science Information Services and T. echnology (IASSIST), and 
serves on the governing Council for the Inter-university Consortium for Political and 
Social Research (ICPSR). Ms. Stephenson is helping to design an archival development 


policy for ICPSR, focusing on the future data and documentation needs of social science 
tesearch. 


KENNETH THIBODEAU is currently Director of the Center for Electronic Records at 
the National Archives and Records Administration, which is the archival custodian for the 
permanently valuable electronic records of the federal government. In 1995, he was 
detailed from NARA to the Department of Defense, where he served as Director of the 
Records Management Task Force, which was set up to implement business process 
reengineering for records management. Prior to his appointment as Center Director in 
December 1988, Dr. Thibodeau was the first Chief of the Records Management Branch of 
the National Institutes of Health (NIH) with responsibilities for records management, 
office automation, privacy, and strategic planning for information resources management. 
Dr. Thibodeau holds a bachelor's degree in history from Fordh 


; am University and eared a 
Ph.D. in the history and sociology of science from the University of Pennsylvania. He has 


CCCTEEEEEEE eee: 


4) 


held fellowships in history and computer science at the University of Strasbourg, France, 
the University of Kansas, and the Newberry Library in Chicago. 


ANNE VAN CAMP was appointed Member Services Officer of the Research Libraries 
Group (RLG) in September 1996. Her portfolio of responsibility includes working with 
the international primary sources community in finding solutions to problems surrounding 
preservation and access to primary research materials. Prior to joining RLG, Ms. Van 
Camp served for eight years as Archivist of the Hoover Institution at Stanford University 
where she managed a world-renowned archives on twentieth century social and political 
change. From 1980-1989, she was Archivist and Manager of Information Services of the 
Chase Manhattan Corporation. Active in professional associations nationally and 
internationally, she served on the governing council of the Society of American Archivists 
from 1990-1993, and is currently serving as an appointed member of the United States 
Department of State's Historical Advisory Committee. 


DONALD J. WATERS is the Associate University Librarian in the Yale University 
Library. Dr. Waters has a B.A. in American Studies from the University of Maryland, 
College Park, and a doctorate in anthropology from Yale University. After completing his 
degree in 1982, Waters served as a user consultant in the Yale Computer Center and as 
the Director of Computer Services at the Yale School of Management. In 1987 he joined 
_the library as the Head of the Systems Office. He became Director of Library and 
Administrative Systems in 1992 and was appointed Associate University Librarian in 
1993. In addition to his responsibilities at Yale from 1994-1996, Waters served as a co- 
chair of the Task Force on Archiving of Digital Information, which the Research Libraries 
Group and the Commission on Preservation and Access jointly sponsored. Waters was 
the principal author of the Task Force report. Among his other publications, Waters is the 
editor of a collection of African-American folklore originally published during the 1890s in 
the Southern Workman, the newspaper of the Hampton Institute. 


42 


APPENDIX C 
CONFERENCE PROGRAM 
FEBRUARY 10 (MONDAY) 
7:00 p.m. Reception 
7:30 p.m. Opening Dinner and Welcome 


Philip L. Cantelon, History Associates Incorporated 
Bert C. Roberts, Jr., MCI Communications Corporation 
Vinton G. Cerf, MCI Communications Corporation 
Nathan Myhrvold, Microsoft Corporation 

Keynote: “Communications Revolutions” 


James M. Muldoon, Department of History, Rutgers 
University-Camden 


FEBRUARY 11 (TUESDAY) 
8:00-8:45 a.m. Breakfast 


9:00-9:15 a.m. —_ Introductions and Orientation 
James B. Gardner, History Associates Incorporated 


9:15-10:00 a.m. “Documenting the History of the Digital Age: What Do We 
Know?" 
Chair: Steven Hensen, Perkins Library, Duke University 
Paper: Vinton G. Cerf, MCI Communications Corporation 
John C. Klensin, MCI Communications Corporation 
Response: Gwen Bell, The Computer Museum History Center 


10:00-10:15 a.m Break 


10:15-11:45 a.m. ”Capturing History Digitally: Why Archive the Internet?" 
Chair: John McChesney, National Public Radio 
Paper: Nathan Myhrvold, Microsoft Corporation 
Response: Steven Lubar, National Museum of American 
History, Smithsonian Institution 


11:45 a.m-1:00 p.m. Lunch 


43 


1:00-3:00 p.m. ”Assessing the Need: What Information and Activities 
Should We Preserve?" 
Chair: John Markoff, New York Times 
Paper: Michael Miller, Records Management, National 
Archives and Records Administration 
Responses: 
Paul Ceruzzi, National Air and Space Museum, 
Smithsonian Institution 
Francis X. Blouin, Bentley Historical Library, University 
of Michigan 
Elizabeth Stephenson, Institute for Social Science 
Research, UCLA, and the Intemational Association 
of Social Science Information Service and 
Technology 


3:00-3:15 p.m. Break 


3:15-5:00 p.m. Developing Strategies: How Do We Archive Digital Records?" 
Chair: Birgit Ireland, History Associates Incorporated 
Papers: 
Brewster Kahle, Internet Archive 
‘Donald Waters, Yale University Library and the Task 
Force on Archiving Digital Information 
Response: Adam L. Gruen, MCI Corporate Archives 


6:30 p.m. Leave hotel (transportation provided) for demonstration of the 
Internet Archive and dinner at the Presidio 


FEBRUARY 12 (WEDNESDAY) 


7:30-8:15 a.m. Breakfast 


8:30-10:00 am. ”What Are the Legal Complications in Archiving Digital Records?" 
Chair: Ronald Plesser, Piper and Marbury 
Papers: 
Marc Rotenberg, Electronic Privacy Information Center 
Trotter Hardy, William and Mary School of Law 
Response: Anne Van Camp, Research Libraries Group 


44 


10:00-10:15 a.m. Break 


10:15-11:30 a.m. ”How Do We Make Electronic Archives Usable and Accessible?" 
Chair: Jim Miller, Massachusetts Institute of Technology and 
the World Wide Web Consortium a? 
Paper: Margaret Hedstrom, School of Information, University 
of Michigan 
Response: ae 
C. M. Sperberg-McQueen, University of Illinois, 
Chicago, and the Text Encoding Initiative 
Bill Barnes, Slate Magazine, Microsoft Corporation 


11:30-12:30 p.m. Working Lunch: ”Taking Action: What Do We Do Next? Who 
Will Take Responsibility? Who Will Provide Funding?" 
Chair/discussion leader: Philip L. Cantelon, History Associates 
Incorporated 
Comments: 
Vinton G. Cerf, MCI Communications Corporation 
Mark Luker, NSFNet Program, National Science 
Foundation 
Kenneth F. Thibodeau, Center for Electronic Records, 
National Archives and Records Administration 


12:30 p.m. Adjournment 


APPENDIX D 
POSITION PAPERS 


45 


COMMUNICATIONS REVOLUTIONS: 
LITTLE HISTORY, MUCH MYTH 
A Paper Presented at 
DOCUMENTING THE DIGITAL AGE 


San Francisco California 
February 10 - 12, 1997 


James Muldoon 
Professor of History 
Rutgers University 
Camden College of Arts and Sciences 


Any one who has looked even for a moment at the biographical statements we 
were asked to fill out for this conference might legitimately ask "Why is a medieval 


legal and ecclesiastical historian the keynote speaker for a conference on the digital 


age?" That is a fair question and deserves an answer, so allow me to spend a few 


minutes explaining why | am addressing you and not a specialist in the history of 


science or technology or, at the very least, a modern historian. The reason is quite 


simple: | am not any of those kinds of historians. Instead, | bring to the fundamental 


point of this talk, comparing and contrasting the printing revolution of the fifteenth 


century and the computer revolution of the twentieth, the perspective of someone 


whose professional career has been shaped by earlier communications revolutions 


that have characterized European history. That perspective provides, | trust, some 


particular insights for those anxious to record the early history of the electronic 


communications revolution. 
From my perspective the communications revolution has been thus far a 


four-stage movement developing over almost 1500 years. 


In the first place, there was the communications revolution of the sixth to the 
tenth centuries. In this stage, the classical and Christian literary heritage of the 
Roman world was copied onto parchment by monks anxious to preserve the ancient 


learning that was threatened with extinction because it was written on papyrus, a 
product admirably suited for the written word in warm, dry Egypt but which decayed 
old and wet lands north of the Mediterranean. This not only served to 


rapidly in the c 
sought to 


retain this ancient knowledge, but to spread it as well, first as monasteries 


expand their libraries by obtaining copies of manuscripts found elsewhere. 
Subsequently, in the eleventh and twelfth centuries, the establishment of cathedral 
schools and then universities meant that this preserved knowledge would be available 


throughout the literate European world. The copyroom where Brother Francis labored 


in A Canticle for Leibowitz provides a realistic picture of an early medieval scriptorium 
and the market it served. 

The second stage of this revolution occurred with the appearance of the 
printing press in the mid-fifteenth century. As we shall see, this stage had much in 
common with what preceded it. 

The next great leap in communications technology came in the nineteenth 
century with the invention of devices to send messages over wires. The telegraph, 
now virtually defunct, and then the telephone enabled people to communicate with 
one another over vast spaces. Radio and television were twentieth-century 
continuations of the electrical communications stage of development. 

The final stage in this development, although most likely only the most recent 
stage, is the computer-based digital communications revolution of our own day. 

My own research is closely connected to and greatly affected by the 
transformation from handwritten manuscripts to the printed page. In the first place, 
the early printers produced works for which there already existed a market. In the 
case of the field in which | work, a large number of manuscripts from the earliest 
stage of the development of the canon law, the law of the medieval church, remained 


unpublished because they had been superseded by later developments. 


Fifteenth-century lawyers wanted the more up- to-date commentaries on the law, not 
the earliest ones. There was no market for the foundational records. Many of these 
early writings have never been published and remain available only in manuscript 
form, yet in order to understand the early development of the legal system of which 
they are a fundamental part, these documents are of vital importance. In the second 
place, even the published works, the later legal writings, appear only in editions that 
are often, by twentieth-century standards, poor because they were based only on 
whatever manuscripts happened to be available to the printer at that moment. There 
was little attempt at creating a scholarly accurate text, one free of copyists’ errors and 
so on. Admittedly this would have been difficult as there was no index of 
manuscripts, no central archive that contained a list of where they might be located. 
Fosthernorey adding to the difficulties that scholars would later face, once a text had 
been set in type, the parchment page could be thrown away because after all we 
have the printed version don’t we? 

What | have just described here, from the perspective of someone in the late 
twentieth century who has to deal with it, is one of the basic problems in the history 
of technology, the loss of information that occurs when a new technology is 
introduced and supersedes an older one. At this point we have come to the main 
theme of my talk, some comparisons and contrasts between the printing revolution 
and the digital revolution and some suggestions for those involved in recording the 


history of the latter. 


In keeping with a basic premise of all teaching, let us begin with a statement of 
what we might call the basic understanding of the printing revolution, a starting point 
from which we can all proceed, the common knowledge’ pf this revolution that we all 
share. Around 1450, Johann Gutenberg invented the printing press. The 
consequences of this invention were extraordinary as the history of the next 150 years 
demonstrated. In the first place, literacy increased throughout Europe. The greater 
availability of books meant the beginning of the end of the Catholic Church’s control 
of the intellectual life of Europe, a movement that culminated in the Protestant 
Reformation and the scientific revolution. Information now flowed across national 
boundaries undercutting the powers of absolute monarchs as well, paving the way for 
the modern democratic, constitutional state, and, in general, creating what we term 
the modern world. Furthermore, this international flow of written materials contributed 
to a sense of belonging to a Europe-wide society instead of a limited national 
whadon: 

The interesting point of these statements of "common knowledge’ is that none 
of them is completely accurate and some are quite wrong. We could even call it a 
collection of ‘common misunderstandings’ that we all share. In the first place, it is 
clear that Gutenberg was only one of several individuals who were experimenting with 
printing. There was a widespread interest in producing books by technological 
means rather than by hand. In the second place, the crucial invention was not the 
press but moveable type. The printing press, using wood blocks to print playing 


cards, devotional pictures, and the like had been available since the early fifteenth 


century. Indeed, without a series of earlier developments, the press, paper, even the 
alphabet which required fewer than 100 items, including upper and lower case letters, 
punctuation marks, and numbers Gutenberg’s invention would have been of little 
significance. The subsequent development of printing in turn depended upon other 
improvements such as those in metallurgy that to the discovery of the proper mix of 
lead, tin, and antinomy to be used in the casting of type. What was important was 
the bringing together of several developments, the synthesizing so to speak of several 
technologies in order to achieve the final product, and then subsequent refinement of 
the result. The pressure to meet the needs of the market then encouraged im- 
provements that would lengthen the working life of the type. 

Granted that the technology of printing was the result of an evolutionary 
process, surely the sociological consequences were enormous. For example, literacy 
must have become of more use with the appearance of books. In fact, it was largely 
the other way round, the development of printing owing more to the increasingly liter- 
ate populations of urban Europe, the Rhineland, the Netherlands, and Italy than is 
generally recognized. The market demand for the written word pushed the 
development of printing rather than the reverse. 

What about the Reformation? Was it not sparked by the publication of Martin 
Luther’s 95 Theses throughout Germany within weeks after he posted the handwritten 
original on the door of the university church in Wittenburg? To the limited extent that 
this statement is true, the setting of Luther’s words in type was of marginal 


significance. Handwritten copies could have had the same effect. The same is true 


of the other consequences attributed to the invention of printing. | It was not after all 
only the Protestants who employed the printing press to advocate their views. The 
Catholic Church did the same thing. Furthermore, it is important to realize that the 
Protestant leadership was no happier with the members of the lower classes airing 
their theological and political opinions in public than the Catholic. 

Did governments fear printing? The answer is perhaps, but, like the religious 
leadership, secular rulers sought to use it for their own end. While it is true that 
critics of governments could now reach a larger audience, so too could governments. 
One consequence of printing was the standardization of language. Printers could not 
afford to print materials in each of the dialects found within every European kingdom. 
One consequence was the decline and even disappearance of those dialects that 
could not justify the costs of printing. Another was the way in which rulers insisted on 
publishing official documents in what we would now term the official national 
language thus requiring all their subjects to use that dialect. The appearance of a 
grammar of the Castilian language in the 1490s and the creation of the French 
Academy by Louis XIII to standardize the French language in 1635 are fundamental 
dates in the process of creating national languages. Latin, the international language 
of learning in the Middle Ages, gradually declined as the vernacular languages of 
Europe in their standardized versions rose. The use of the printing press, to the 
extent that it contributed to the development and spread of national languages 


reduced a sense of internationalism and even strengthened the hands of rulers who 


understood how to use language to their own advantage. 


What we have considered in this discussion is what we might call the 
mythology of the printing revolution, that is the history of printing as we would like it 
to have been. In the first place, we like the heroic image of the lone inventor who 
creates something new and different that then has extraordinary consequences. In 
the second place, we are inclined to see the consequences of an invention as making 


a quick, sharp break with the past. The slow, sometimes zigzagging process of 


change is not aesthetically appealing. 


Finally, and most important, we are inclined to see all the significant events that 


followed the invention of printing as a consequence of that invention. This is in fact 
one of the oldest of logical fallacies, that of post hoc ergo propter hoc, that is, if event 


B follows event A then B was caused by A. A classic statement of this fallacy that 
those of us of a certain age might remember occurred in the opening moments of the 
old television program, Rhoda. In a voiceover, Rhoda is heard to say that she was 
born in the summer of 1941. The Japanese attacked Pearl Harbor in December. My 
mother always thought there was a connection. As far back as the late sixteenth 
century Francis Bacon was identifying printing as one of the crucial discoveries that 
created the modern world, but that does not make it necessarily so. In reality, Bacon 
and others who made this claim were doing so on ideological grounds rather than on 
historical ones. 

The history of the development of printing is thus filled with gaps, voids, and 
unverifiable conclusions and has been so for a very long time. As early as 1520 there 


were those who were seeking to determine the facts of the development of printing; 


they were unsuccessful. The mists of time had already enfolded the story. 


What does this tale of myth and history have to tell those interested in the 
development of the digital revolution? While carefully avoiding that other popular 
mythology that begins: history teaches ....", let me make a couple of suggestions. 
The first thing to do is to place the digital revolution within the context of the 
communications and technological developments of the twentieth century. The 
laptop on which | wrote this paper is after all the product of several important 
developments, not just one. 

The second point is to avoid the ’great inventor’ approach. This is not to 
denigrate the work of any individual, only to remind you that the great inventions are 
the result of process and synthesis, not lonely individuals working in attics or garages me 
isolated from the wider world. | 

A third point is to avoid seeing the history of the digital revolution as a single ml 
straight? ever-rising line. It is one of the conceits of poor history of science and ain 
technology to see this history as the continuing victory of reason, common sense, me 
and other virtues over the dark, not to say evil, past, moving from mankind from the 
valley of shadows to the sunny high plains of the modern world. This approach tends = 
to neglect the dead ends, the failures, the ill-conceived projects, that litter the history an 
of all aspects of human history. Some of these dead ends and failures, however, 
have seeds that subsequently sprout when new information or technology becomes ‘ai 
available. Awareness of these aspects of the digital revolution is especially important a 

_ 


because the span of time that constitutes a generation in this area is brief, as little as 


| 


six months. The scrapheap may well contain ideas and materials that, with a bit more 


development, could make a significant contribution to the field or, when placed in a 


new context, play a role not conceived of by their inventors. 


A fourth consideration should be "what has been lost" as we move from one 
technology and mode of communication to another? When parchment replaced 
papyrus, a certain number of Latin texts were lost because they were not copied onto 
the new material or because only one or two copies were made and they were 
destroyed in the course of time. Later, as | pointed out, the use of the printing press 
led to the loss of manuscripts as printed versions appeared. Not only were the 
original texts lost, marginal notes that provided commentaries on the text were also 


lost, thus leaving us in the dark about how medieval readers understood the 


materials. 
Something similar to these losses occur as modern texts move to the 


electronic world. For example, the authors of scholarly articles that are placed in 
electronic data bases are asked to provide key words so that other researchers can 
quickly locate and use them. A useful idea and it certainly will speed up data 
searches. The drawback is that the reader of an article may actually find a significant 
point that did not appear so important to the author who therefore never provided a 
key word to guide the searcher to the article. Both cases eliminate the serendipities 


that came with the earlier systems. 
Another form of loss occurs when library catalogs are converted to digital form 


Recently, while using the old card catalog at the University of Pennsylvania Library 


10 


(because the computerized catalog was down), | was talking with one of the librarians 
who happened to be doing the same thing. He pointed out that when the cards were 
converted to a digital record,.not all of the data on the cards was entered. This was 
especially true of the very old cards that often had extensive cataloguing information 
handwritten on them. Because the Penn Library has extensive holdings in the history 
of the Near East and Egypt, these old cards often have notes written in Arabic, Greek, 
and other languages that the computer program for the catalog did not accept, this 
data has not been entered in the new catalog. The result that this very detailed 
information will be lost to future generations of scholars. At the very least, users of 
the new technologies ought to be made aware that the older technologies often 
contain valuable information not found in the computer. 

In the category of what has been lost, there are also the early devices, not 
simply the vacuum-tube monster machines of forty years ago, but the Ataris and 
Conwnadares and the operating systems, especially the proprietary ones, that ran 
them have been largely lost. As the invention of printing led to the standardization of 
languages meant the end of the dialects and languages that did not attain printed 
form, so the emergence of standard operating systems and the dominance of 
particular programs for word processing, bookkeeping, and so on meant the end of 

operating systems and programs that had achieved only limited market share. Never- 
theless, there remains both technical knowledge and information in these machines 


and on these systems that may yet prove useful. 


11 


pean 


4 


CULLELett 


<. 


| 


Let me conclude with a kind of cautionary tale. The questions this conference 
will be asking about the digital revolution have all been asked before. That is 
obvious. What is of special interest is that by 1520, two generations after the printing 
press appeared on the scene, Europeans were asking about where it came from, how 
it was developed, and what were its consequences. Even at that seemingly early 
date, the origins of printing were lost in the mists of time. Given the short length of a 
generation in the digital world, this conference is not being held any too soon. If the 
historical record is not prepared, then the story of the digital revolution will be sealed 
off from later generations as tightly as the fallout shelter that held the records of the 
Blessed Leibowitz and be as accurate as Brother Francis’s elegant reproduction of 


the blueprint he discovered. 


12 


Documenting the History of the Digital Age: 


What do we Know? 


Vinton G.Cerf and John C. Klensin 


MCI Communications Corporation 


Introduction 


This paper discusses in brief terms what we know about the history of the Digital Age and 
then explores how we know it and what impact the Internet, its World Wide Web, email, file 
repositories and other functions, may have on our knowledge of history and our ability to 
record, organize, and recall facts and events in that history. Any attempt to cover 
comprehensively what we know about the Digital Age would surely take many terabytes. At 
best, one might be able to try to select some key events in the history of the Digital Age and, 
even then, there are sure to be disputes over the choice of the events deemed to be seminal. 
Alternatively, one can consider the nature of what we know, without attempting to be 


comprehensive and specific about its substance. This brief paper timidly takes the latter 
approach 


Computer Evolution 


Strictly speaking, one might mark the beginning of the history of the Digital Age with the 
invention of various calculating engines attributable to legendary names such as Pascal, 
Liebniz and Babbage". Indeed, one might even choose to go back some 5,000 years to the 
invention of the abacus. The stones or beads of this device might be said to represent the 
eolithic period in the history in digital computing. Pascal and Leibniz invented more 
advanced, but still mechanical calculating engines in the 17" Century, as did Babbage in the 
19". If we adopt the term eolithic for this early mechanical work, we might then speak of the 
work of Hollerith, Atanasoff, Von Neumann, Aiken, Eckert, Mauchly, Turing, Zusel!, and 
many others of the first half of he 20" Century as the paleolithic era. With those examples as 
preamble, we might then mark the beginning of the neolithic with the invention of the silicon 
transistor in 1947 by Bardeen, Brattain and Shockley; the integrated circuit by Jack Kilby 
and Robert Noyce in 1958;*and the Intel 4004 microprocessor on a chip by Marcian Hoff in 
1969.“ Thedramatic evolution in storage media and costs have been nearly as important as 
those in processors. This story, too, starts with beads, but evolves along two paths: tubes and 
delay lines to the development of core memory and thence to high-density dynamic and static 
solid state memories and wire boards and paper tape through the high-density rotating devices 
and very high capacity tapes of today. If we were still faced with the storage costs and 
densities of the 1960s, we couldn’t have the web (or MSWord) today -- they are just too 
demanding in terms of storage utilization. 


To these early results, we can add the general evolutiv.. of large scale integration, reaching 
now the middle years of VLSI with the Intel Pentium MMX, Motorola 60X series PowerPC 
chips, MIPS 10000, and the special purpose chip sets used in high end super computers such 
as the various machines of the CRAY line. Some evolutionary branches, such as the once- 


fiiiae 


4H 


promising Connection Machine series invented by Danny Hillis and built by Thinking 
Machines, Inc., seem to have died out. Departmental computers, the so-called mini-machines, 
such as the Digital PDP-11 and VAX series have prospered, although they are being 
displaced by high end workstations that have evolved into servers while workstations are 
being driven from below by high-end personal computers. Every generation of machine 
seems to drive out previous. higher-end generations, although it appears that no class of the 
basic taxonomy has been driven totally to extinction. Thus we still have supercomputers, 
mainframes, departmental servers, workstations, personal computers, palm tops and personal 
digital assistants. Certain species have definitely died, but classes seem to grow, at the lower 
end, as technology allows us to place processing where we wish it at lower and lower cost. 


But computers form only a part of the story. A similar evolution has taken place in other areas 
such as programming languages, operating system design, and computer networking. To 
these evolutionary vectors, we may add the general digitization of everything - voice 
telephony, audio recordings, video camcorders, clocks, and the controls of automobiles and 
an increasing number of household appliances. 


In the end, all of these manifestations of the Digital Age merely form the stage upon which a 
drama is playing out: the evolution of software applications. In all probability, there is no end 
to this play. On this stage, virtually anything that can be programmed is possible, and what 


limits are there to the creativity of the human mind? 


Programming Languages and Operating Systems 


The evolution of computers from discrete, reed relay and tube switches to transistors, to very 
large scale integration microprocessors is plainly an important part of what we know. The 
development of programming languages from assembly languages to FORTRAN, COBOL. 


ALGOL, PL/I PASCAL, SMALLTALK, C, C++, PERL and JAVA marks yet another 
important evolutionary dimension. Operating systems started with the early batch, stand-alone 
varieties, evolved into multi-tasking systems such as IBM’s OS/MFT, OS/MVT, and MVS, 
UNIVAC’s EXECS, and into interactive time-sharing systems, such as MIT’s CTSS, BBN’s 
TENEX, AT&T's UNIX and Digital’s TOPS-10 and later VMS for the VAX series, Control 
Data’s COMPASS operating system were representative of a class of operating environments. 
The Multics operating system developed at MIT for a modified GE-645 machine deserves 


special mention for its hardware-assisted security capabilities and for being the first 
significant realization of a number of ideas that were strongly influential in the design of 


subsequent, and even contemporary, systems, . Time-shared systems were used extensively in 
networked environments, such as the ARPANET. Workstations, from such companies as the 
SUN Microsystems, Hewlett-Packard and Silicon Graphics, typically ran some variant of 
UNIX operating system, as did many Digital systems, in addition to Digital’s VMS. 


Oddly, with the arrival of personal computing in the late 1970s, users were retumed to 
systems that were closer to their ancient, stand-alone cousins. Apple’s pre-Macintosh 
operating system essentially allowed the operation of one program at a time. The Macintosh 
MAC OS allowed more than one program to be initiated, but typically only one would be 
active at any one time. The same was true of Microsoft’s MS/DOS and Winc ws operating 
systems. There were some concurrent I/O opportunities allowed, but it was not until UNIX 
was ported to PCs and Microsoft produced its NT system that one had multiprocessing 


capability on the desktop. 


Networking and Protocols 


The networking dimension starts with simple, remote stations, such as the IBM Job Entry 
Stations containing card readers, card punches and printers for the remote submission of 
batch jobs; for example, the use of 1401s and their peripherals as job stream I/O devices for 
the 709/709x under FMS and later IBSYS. As interactive time-sharing is explored, remote 
access on the Public Switched Telephone Network (PSTN) becomes possible at relatively low 
speeds (110 - 300 bits per second). These systems were all terminal-to-host designs, 
permitting a single system to serve multiple, remote users. In the mid-late 1960s, anew 
networking concept was explored in the US through the Defense Advanced Research Projects 
Agency’s ARPANET and in the UK at the National Physical Laboratory (NPL). P acket 
switching was the term used to describe this idea although it was not called that when it was 
first documented in the early 1960s in reports from RAND and from MIT. The project at NPL 
involved a single switch and the originator of the project called the switched units, packets. 
In some ways, the NPL project might be considered the first local area network. 


The first ARPANET nodes were developed by Bolt Beranek and Newman (BBN) and 
installed late in 1969 and early 1970 at UCLA. Stanford Research Institute (SRI), University 
of California at Santa Barbara (UCSB) and University of Utah. A major public demonstration 
of the ARPANET was conducted in October 1972 during the first International Conference 
on Computer Communication (ICCC) in Washington, DC. At this same conference, the 
International Network Working Group (INWG) was formed, as an analog to the Network 
Working Group (NWG) which developed the first ARPANET computer communication 
protocols. By the early 1970s, many other investigations of packet switching were under way, 
notably the Experimental Packet Switching System developed by the British Post Office, the 
CYCLADES project of the French Institute for Research on Informatics and Automation, the 
Reseau Communication par Paquet (RCP) project of the French telephone service and the 
Canadian DATAPAC project. In the US, around 1972, BBN started one of the first successful 
packet switching services, TELENET, but used a set of protocols that were very different 
from those in use on the ARPANET (see Protocol Wars, below). 


Computer communication protocols emerged as the central theme of the ARPANET effort 
and were developed in the context of interconnected time-sharing systems. The basic 
interface between the ARPANET nodes (called Interface Message Processors, or IMPs) was 
defined in a report from BBN: BBN1822. It became the bible for interconnecting computers 
on the network. But software was needed to allow computers of vastly different speeds, word 
sizes and formats to interwork. Standards had to be set to achieve commonality. Originating 
with a small group of graduate students at UCLA, University of Utah, UCSB and researchers 
at SRI, the ARPA Network Working Group tackled the host protocol problem, Starting 
before the first equipment had been delivered. The result, motivated strongly by the 1972 
ICCC demonstrations, was a collection of layered protocols, with BBN1822 at the bottom, 
the Network Control Protocol (NCP) and Initial Connection Protocol (ICP) at next higher 
layer. Utilities and applications were layered above NCP and were developed by groups 
across the country involved in the host protocol work. Examples of this layer include, file 
transfer (FTP) and remote terminal access (TELNET). Electronic mail was, at first, an 
extension of FTP but later used its own Simple Mail Transfer Protocol (SMTPMany other 
protocols were developed, together with a number of applications in the ensuing 20 years 


before the ARPANET'’s retirement in 1990. 


The initial ARPANET operated on 50 Kb/s circuits provided by AT&T. The network started 
with 4 nodes but grew quickly to its planned 19 nodes, and then expanded as success in 
military, academic and research communities drove increased demand. 


Commercial networking began to emerge in this same time period (late 1960s and early 
1970s). Proprietary work at Digital Equipment Corporation on DECNET, at IBM on Bisynch 
and later, Systems Network Architecture (SNA) and standards efforts on X.25 at the 
Consultative Committee on International Telegraphy and Telephony (CCITT) of the 
International Telecommunications Union (ITU) marked the commercial networking sector. 
Work at Xerox Palo Alto Research center on the Xerox Networking System (XNS), while not 
commercially successful, was taken up by Novell and transformed into Novell’s Netware and 
IPX products which were very successful in the market until overtaken by newer, non- 
proprietary international standards. 


With the success of the ARPANET, ARPA began a series of projects to explore the use of 
packet switching in new media, specifically, shared mobile digital radio and shared satellite 
channels. These so-called multi-access channels had their origins in another ARPA 
sponsored project, the ALOHANET, at the University of Hawaii. The developers used old 
taxi radios, outfitted with digital logic to packetize data and to send it in bursts over the air to 
a central computer at the University of Hawaii in Honolulu. The central controller was called 
the Menehune (which was the Hawaiian word for “imp” Terminals with text to send would 
transmit their bursts whenever they had data ready. If they heard no reply from the 
Menehune, they would retransmit. The innovation was that all terminals used the same 
inbound radio channel. Collisions would result in garbling and the Menehune performed 
checksums to detect such damage. If a collision occurred, the Menehune would ignore the 
data. The terminal, hearing no . on a separate outbound channel from Menehune, would 
retransmit, but at a random time later, to avoid re-colliding. No wonder they called it the 
ALOHANET; it was rather relaxed about the way the terminals took turns. The inventor of 
the Ethernet tells of reading a report from this project in 1973, visiting the site, and returning 
to XEROX Palo Alto Research Center (PARC) to invent a new form of local area networking 
using ALOHANET multi-access protocols on a coaxial cable, adding a carrier sensing 
collision detection mechanism which minimized the effects of colliding packets. 


The Packet Radio and Packet Satellite programs extended multi-access technology in two 
ways. In the Packet Radio case, multi-hop services using a single, shared radio channel were 
offered in a dynamically-changing topology. In the case of Packet Satellite, two channels 
were used, on for uplink and one for downlink, so that the terminals could simultaneously 
send and receive. Packet Satellite propagation delays were relatively long since it takes at 
least a quarter second for a transmission to propagate from the ground to a synchronous 
satellite and back to Earth again. These two research networks explored dynamic topology 
packet switched networking and long delay networking in multi-access environments. Not 
surprisingly, these networks were very different from the ARPANET and the Ethernet. Their 
packet sizes were different; their error rates and speeds were different. ARPANET ran at 50 
Kb/s using 128 byte packets. The Packet Radio system used 100 Kb/s and 400 Kb/s modes 
with packet sizes on the order of 255 bytes. The Packet Satellite system used 64 Kb/s 
channels and packet sizes up to about 512 bytes. 


For lack of time and space, | have left out the important story of ring-based networks such as 
the Distributed Computing System (DCS) from University of California, Irvine; the PRONET 
from Proteon, based on work at MIT based on DCS; and Fiber Distributed Data Interface 
(FDD)) rings. And J have left out the fascinating histories of Asynchronous Transfer Mode 
(ATM) switching, Frame Relay, and recent emergence of IP switching. A comprehensive 
discussion would surely include these and many other important networking developments 


; ivel 
that took place in the last quarter of the 20" Century. and the last minutes, figuratively 
speaking. of the Second Millennium. 


INTERNET 


At the time that the Packet Radio and Packet Satellite experiments were underway, twas 
logical to ask how these different packet networks could be interconnected. Out of the 
question came the ARPA internetting project’“. The objective was to develop a set of 
protocols which would permit transparent networking across a collection of inter-linked 
packet networks. A number of principles guided the design: 


1. Each network was to remain independent and unmodified 

2. The interconnecting black boxes (later called gateways and, still later, routers) should 
retain as little state information as possible 

3. The networks would not be relied upon for reliability. This had to be achieved on an end- 
to end basis. 

4. There would be no central, global control 


The focus of the research for what is now called the Internet was on the protocols to be used 
by the hosts on an end-to-end basis and the protocols to be used in the routers to interlink the 
various networks to each other. It was recognized almost immediately that a new set of end- 
to-end protocols to replace the older ARPANET Network Control Protocol (NCP) and Initial 
Connection Protocol (ICP) would be needed to achieve better performance and end-to-end 
reliability. The older protocols had depended on the reliability of the ARPANET and its 
ability to sequence traffic to achieve reliable, ordered communication. Moreover, it was clear 
that the gateways (routers) would need some way to communicate topological information to 
route packets from their sources to their destinations. 

In the initial implementation, a single Transmission Control Protocol (TCP) was specified 
which provided reliable, sequenced, flow-controlled and error-free communication on an end- 
to-end basis. The gateways were aware of the format of TCP packets and used their headers 
to make routing decisions. After two iterations on the design, and some extensive testing, it 
was concluded that to support real-time applications that did not require sequencing as much 
as low delay, the TCP protocol would be split into an Internet Protocol (IP) and a residual 
TCP which only concerned itself with end-to-end functions (flow control, sequencing, error 
recovery and multiplexing). 


The idea of layered protocols was lifted from the ARPANET experience and used 
extensively in the Internet design. The various utility and application protocols for remote 
access (TELNET), file transfer (FTP) and electronic messaging (SMTP) were taken into the 
Internet system almost without modification. 


After some four iterations on design, implementation and testing, with an international 
community of interest (using the INWG formed at ICCC in 1972 and specific, ARPA-funded 
research groups in the US and Europe), the Internet Protocols stabilized in their fourth version 
by about 1978. Documentation, testing and wider implementation occupied the next 4 years 
until January. 1983, when it was decreed that all ARPANET hosts would migrate to the new 
TCP/IP protocol suite. For many, January 1983 marks the birth of the operational Internet. 


By the mid-1980s, several important events influenced the future course of the Internet. 
Commercial routers were offered by companies such as Cisco Systems and Proteon. The 
National Science Foundation (NSF) initiated its NSFNET project which led to the creation of 


a number of intermediate level networks to provide connectivity between university and 
research sites and the NSFNET backbone. And interest in the Internet began to emerge more 
fully in the European community as well as the United States with the commercial availability 
of local area networks, UNIX systems with TCP/IP built-in, and routers from commercial 


sources. 


By the late 1980s, US Government policy was shifting to allow limited commercial access to 
the Internet, particularly with its decision to allow a commercial email service, MCI Mail, to 
interconnect with the Internet in 1989. Appropriate Use Policy (AUP) for the Internet 
continued to evolve until the retirement of the NSFNET in April, 1995, at which point all 
restrictions were removed on the general use of the Internet. As early as 1990/1991, 
commercial Internet services were emerging, most notably Alternet operated by UUNET and 
PSINET, both of which began as non-profit organizations serving an academic and research 
communities. The NSFNET AUP restrictions drove commercial service providers to find 
alternatives to interconnection via NSFNET and a number of them banded together to form a 
Commercial Internet eXchange (CIX) to avoid usage limitations. CIX became a kind of 
model for the later NSF-sponsored Network Access Points (NAPs) which served as 
unrestricted packet exchange points in the post-NSFNET environment 


Other networking activities of this era deserve more than their brief mention here, notably 


USENET, based on the UNIX-to-UNIX Copy Program (UUCP); BITNET, developed 
initially to link academic computing centers using IBM mainframes, FIDONET, developed to 


interlink collections of personal computers through dial-up, store-and-forward protocols. 


Protocol Wars 


It would be inaccurate not to mention, at least, the lengthy period from about 1978 to 1992, 
during which considerable debate surrounded two rival non-proprietary standards for 
computer communication, the TCP/IP protocols developed for the Internet and the so-called 
Open Systems Interconnection (OSI) standards developed in the International Organization 
for Standardization (ISO) and the CCITT The original documentation for the OSI reference 
model emerged in an architecture paper in 1978. There ensued many years of detailed 
specification work on a 7-layer architecture, along with a unique vocabulary of terms to 
describe the functionality of the proposed networking protocol suite. 


The origins of this debate were found in the X.25 virtual circuit standards developed in the 
CCITT starting in 1973 and reaching first standardization in 1975. In this model, reliability 
and sequencing would be achieved in the network and not in the end machines. At the time, 
this was an understandable commercial motivation because most computer networking was 
based on dedicated, leased circuits. The commercial providers of computer communication 
network service wanted to emulate as closely as possible the existing circuit services and, 
thus, developed a virtual circuit design. The Internet design, in contrast, assumed that the 
collection of network technologies making up the Internet could not be relied upon to provide 
sequenced, reliable delivery and opted for a datagram mode of operation. All reliability and 
sequencing and error correction would take place on an end-to-end basis. The 
datagram/virtual circuit wars went on for years. X.25 became a very successful service but 
didn’t work well in local area network environments. As LANs became more and more 
prevalent, the Internet datagram mode of operation was increasingly attractive. An effort was 
made to include a datagram mode in X.25 but wa abandoned after it appeared to be far too 


complex to be implemented at reasonable cost. 


The OSI protocols were initially based on the virtual circuit assumptions of X.25 and built on 
top of them. An attempt was made to integrate a datagram mode of operation (the so-called 


connectionless mode) but this came fairly late in the evolution of the OSI protocols and was 
never widely implemented. The connectionless network protocol (CLNP), was actually 
implemented in the NSFNET and used in very limited quantities, mostly during 
experimentation with higher-level OSI protocol implementations. In practice, there was only 
modest implementation of the OSI protocol suite, but this did not stop many governments, 
including the United States, from officially adopting the OSI protocol suite. Despite the 
formal adoption, however, the commercial world continued to build and sell Internet-based 
products until, finally, around 1992, it became clear that little of the OSI effort had reached 
commercial viability. Among the long list of OSI protocols specified. those involving 
messaging seemed to achieve the most penetration in the marketplace. The X.400 messaging 
standards achieved prominence and were used to interconnect public electronic mail systems. 
The X.500 directory standards had somewhat less success but exist in a number of products 
and services. They have, however, provided a model for other systems, including many 
LAN-based mail systems that use neither X.400 nor Internet technology, but that have 
borrowed heavily from X.400 terminology and basic design ideas. Interworking of X.400 
and Internet-based electronic messaging systems, together with proprietary LAN-based email 
services is now commonplace, though by no means perfect. The semantics of the various 
systems continue to interfere with satisfactory end-to-end transparency, and much effort 
continues to be expended on achieving better results. 


In the meantime, the Internet standards for messaging, SMTP (RFC821) and message body 
formats (RFC822) together with the newer Multi-purpose Internet Messaging Extensions 
(MIME) and negotiated SMTP extension system, form the basis for an increasingly 
widespread messaging infrastructure. Proprietary protocols in LAN environments continue to 
be a major factor in corporate use of messaging, but these are increasingly forced by necessity 
to interwork with Internet Standards for the benefit of users. 


It is tempting to continue to try to document more of the Digital Age in this paper, but the 
task is boundless and there are a few other topics that deserve attention. 


How Do we Know the History of the Digital Age? 


In some sense, we only what people take the time to organize and record. We often fail to 
record what later turns out to be important. Few, if any, remember the details of the first 
transmissions on the ARPANET. Should we archive everything in the hope that, later, we will 
have important facts available when they prove to be important (or at least interesting to 
know)? Many historians would appreciate a full record of everything that happens, for 
reference purposes. Software people never got into the habit of using lab notebooks on a 
regular basis, unlike their hardware design counterpartsAs a consequence, much is not known 
in any shared form of the blind alleys and blinding inspirations of the software age. Popular 
accounts get at part of it, but these are often based on recollection rather than documentation. 
Even remembering WHEN something happens is difficult if not written down in an archival 
form. 


The early history of the ARPANET and Internet host level protocol design is captured in an 
extraordinary series of documents labelled Request for Comment (RFC The first RFC was 
issued in the spring of 1969, starting the series. The series has been edited by one person who, 
to this day, continues to make editing this series a part of his career. These have become the 
official documents of the Internet Standards community, though they be~an as a series of 
informal communications among the developers.of the ARPANET and, .ater, Internet 
Protocols. Comparison of the earliest RFCs and the most recent reflect the radically-different 
treatment these documents got in 1970 versus 1997. Originally issued on paper and 
distributed and archived by Stanford Research Institute, these documents are now found 


online in many archives of the Internet. Interestingly, the Internet project started a series of 
notes, Internet Experiment Notes (IENs), in parallel with the RFCs. At the time, the RFCs 
were associaed with the continuing development of protocols on the ARPANET. The Internet 
work was experimental and it wasn’t clear how it would turn out. It seemed, at the time, 
inappropriate to distract the ARPANET community with the uncertainties of the Internet 

it became apparent that the Internet protocols really should replace 
ded and new Internet documents became 
that it was preferable to maintain 
than to have multiple note 
hings or where to find things 


research program. When 
the older ARPANET versions, the IEN series was en 
a part of the RFC series. From this experience, we learned 
well a single series of notes with common indices and the like, 
series and the corresponding problem of deciding where to putt 


of interest. 
One important aspect of the early RFC series is that they conveyed the real debates and 
discussions of design choices and tradeoffs. They formed a record of the primary concepts 
explored in the new territory of packet switching. As email became an increasingly 
widespread and attractive medium of exchange, the RFCs were issued only to capture more 
polished results. Electronic mail archives, however, were established for a number of email 
distribution lists, such as the Header-People list whose members were interested in the 
evolution of electronic mail standards. Another electronic means of mediating discussion 
LANET from the Institute for 


emerged from the so-called conferencing programs such as P 
the Future. In this system, discussion was conducted in text form and each entry was 


cataloged and archived. An editor/moderator had the ability to edit the archive, and any 
s to the historical record. Various ways of indexing the corpus of the 


participant had acces: 
find what they were interested in. Conferencing 


archive were provided to help readers to 
software has evolved but has retained much of the functionality of the early systems. 


n dissemination grew out of the Unix community. An informal 


Another format for informatio 
ly 1980s. 


network, USENET, of UNIX-based machines grew up during the late 1970s and earl 
Built on the Unix-to-Unix Copy Program (UUCP), the so-called NetNews or NewsGroups 


system was developed. Contributions from all interested parties were merged into feeds 
which were distributed throughout the USENET. Special NewsGroup reading programs (we 
would call them clients today) made it easy to sort through and select from the multi-topic, 
aggregate newsfeed. A major technical and sociological story in itself, the history of 
NewsGroups and USENET deserves a great deal more space than I am able to give it in this 


paper. 
None of the methods mentioned so far is perfect and each has different strengths. General 
email has the problem that it arrives in an undifferentiated avalanche of content that can be 
sorted, typically. by subject line or author or date/time of submission. Moreover, a series of 
messages bearing the same topic (replies on replies, etc.) often drifts away from the original 
topic, interfering with accurate archiving and indexing. NewsGroups have a similar, off-topic 
drift problem but if moderated, the moderator can exercise discipline, as can the moderator of 
a conferencing system or a moderated mailing list. New NewsGroup topics can be formed 
and new conference topics can be created. However, one is then confronted with the problem 
of choosing in which group to move forward a discussion, leading to cross-posting and a 
general blurring of the topicality of any particular NewsGroup or discussion group A 
number of efforts, notably those associated with Jacob Palme in Sweden, have attempted to 
merge the models of conferencing systems, newsgroups, and mailing lists into a single user 
environment. None of those efforts have been very successful in the marketplace, although 


attempts continue in both protocols and commercial products 


Attempts to organize email into topic folders usually suffer from the same disease as 
NewsGroups and conference topics. I find myself filing messages in multiple folders, largely 


' : i trieve 
in the hope that | will remember one of the folder/topic names when it comes time to Fé 


the messages of interest. 


ARCHIVES OF INFORMATION 


As networking has spread, so have a variety of technologies for encoding, formatting, d 
distributing and presenting information. File archives, containing programs and text files an 
data, were among the first to appear, thanks to the invention of anonymous FTP ear ly in the 
ARPANET history. These repositories could be accessed by anyone logging in as user 
anonymous, password guest. It was not until the early 1990s, however, that the tens of 
thousands of file archives were indexed by a program called archie developed at McGill 
University. The system indexed roughly 80,000 FTP repositories to simplify network-wide 
searches for files of particular names. This idea was soon followed by Gopher, developed at 
the University of Minnesota, which created a distributed menu system. Each menu screen 
could come from a different processor on the Internet. Gopher eliminated the need to know 
much, if anything, about host names, IP addreses and so on. In the same general time period, 
the Wide Area Information Service (WAIS) was invented to index the full content of 
distributed archives of textual information. The World Wide Web (WWW) brought another 
dimension to the distributed archiving of information, this time in a multimedia setting. Web 
pages containing images, multifont text, audio and video clips and even executable programs 
added a richness to the content of the Internet that had not been seen before, except in 
independent pieces (separate files, etc.). 


It is the WWW which motivates some of the questions surrounding the capture of the history 
of the Digital Age. The other archives and media still persist, as does electronic mail which is, 
itself beginning to take on Web-like characteristics, thanks to improvements such as the 
Multipurpose Internet Mail Extension (MIME) standards developed by the Internet 
Engineering Task Force (IETF). The World Wide Web has unleashed an avalanche of 
multimedia information flowing into servers around the world. Its standard protocols such as 
the HyperText Transfer Protocol (HTTP) and HyperText Markup Language (HTML) 
have become a part of the current idiom. The Uniform Record Locators (URLs) that point 
to the locations of Web pages on the Internet (using Domain Names such as www.mci.com) 
are frequently displayed in print and television advertising, referenced on the radio and 
mentioned in normal. everyday discourse. 


World Wide Web (WWW) 


There are an estimated WWW 500,000 servers on the public Internet, among the estimated 
18-20 million computers on the system (not including all the dial-up desk and laptops that 
aren’t usually counted in surveys). Corporate private intranets probably account for 
anywhere from 4-10 times as many web servers as there are in the public Internet. As an 
increasing amount of business data and correspondence find their way into Internet-based 
media, the question of archiving, indexing and retrieving such information becomes 
increasingly important and relevant. 


It is possible to print many of the pages found in the WWW. Color is typically needed. If 
printing was a sufficient solution, one might be tempted to adapt the archiving and retrieving 
1ethods of the print medium to address the problem of archiving the Web. But printing is not 
sufficient. Nor is recording of audio and video clips into various magnetic media. The WWW 
pages contain a variety of media, encoded for computer interpretation and presentation. These 
various page components can be rendered in various ways, depending on the output medium. 


Some portions can, indeed, be printed. But some components require interpretation as 
sound, as moving images and even as programs. Moreover, keeping the information in its 
computer-interpretable form allows for future, computer-assisted searches to be made. 
Plainly, this is a significant challenge -the cataloging. indexing and archiving of a 
fundamentally new medium. Another aspect that makes printing inadequate --and makes 
historical/documentation work very hard-- is the essentially dynamic nature of some of the 
material. Many pages are useful because they change. sometimes rapidly, over time, and 
printing the material gives a snapshot rather than reflecting either the newest versions or the 


process/ sequence of changes 


regularly and must 


The problems are not easily bounded. New encodings are invented 
the interpretation is based 


become a part of the archiving and indexing milieu. Since much of 
on software, it is necessary to retain the software that can interpret the encoded content or to 
periodically recode the information so it can continue to be interpreted by software that is 
contemporaneously available. One of the most difficult aspects of archiving of computer- 
based content has proven to be the short-lived lifetimes of either the storage media themselves 
or the devices that could read and write them. Just try finding software and/or hardware to 
read an 8” floppy disk so popular with early word processing systems or a seven-track tape 
drive. A regular program of copying old media content into new content might help to 
combat some of these problems. Maintaining the utility of encoded information is going to be 
a major challenge of the Digital Age. The recent entry of the JAVA programming language 
into the Internet may well offer at least one common platform on which to build stability, 
even as new hardware platforms emerge from the rapid pace of semiconductor 
microprocessor evolution. JAVA interpreters can be built on virtually any computing 
platform, allowing JAVA-dependent content to be accessible on newer processing engines. 
But this also encourages more dynamic content and pages built up from multiple-origin 


sources, which may be much harder to capture 


An alternative thought is to produce interpreters of older machines on the newer ones so that 
old software can still be run. Even if this is only a transitional step towards maintaining 
accessibility, it has been a common technique for preserving older but valuable information. 


Summary 


We know a good deal about the history of the Digital Age but our knowledge is becoming 
less complete as this age matures. We are not keeping up with the volume of material that it 
might be wise to archive and index. We do not have programs in place for capturing the 
often-volatile content of the World Wide Web and its Corporate counterparts (Company-wide 
web?). We have few tools for archiving and indexing personal, multimedia content, nor 
national programs for archiving of ephemeral information which could be invaluable to the 
historians of a future era. It seems inescapable that there are a wide range of information 
archiving and indexing products and services which would be of interest in all sectors 
(personal, business, academic and government) where Internet technology has taken root. 


That this will be a complex undertaking is surely an understatement. The preservation of 
privacy and corporate confidentiality will compete with future interest in this information. 
Intellectual property rights treatments must be found which are, at least, compatible with the 
needs of preservation. Deciding what to preserve and who has the responsibility for it is 
another complex questi n. The developers of this technology and its users have a mutual 
responsibility to develop feasible guidelines and techniques for the capture, indexing and 
preservation of valuable information. There is much work to be done, much still to be 
discovered and applied. We cannot avoid the avalanche, but only try to speed ahead of it. 


-_— 


"I was tempted to say “volumes” but that seemed archaic. given the topic! 


ii “Computers,” Encyclopaedia Britannica, Encyclopaedia Britannica, Inc., Fifteenth Edition, 1987, 
Macropaedia. Vol. 16. p. 678 ff 


lit A Historv of Computing in the Twentieth Century. N. Metropolis, J. Howlett and Gian-Carlo Rota 
(eds.), Academic Press, Inc., Harcourt Brace Jovanovich. 1980. This is an extraordinary volume ; 
Prepared from a collection of essays presented at the International Research Conference on the History 
of Computing held at Los Alamos Scientific Laboratory. 10-15 June, 1976. Legendary names abound in 
the list of authors and in the references. The book includes an extraordinary chapter by Brian Randell 


about the Colossus computers built in England during the Second World War to break the German 
cryptographic codes. 


iv “Transistor,” Encyclopaedia Britannica, Encyclopaedia Britannica, Inc., Fifteenth Edition, 1987, 
Micropaedia. Vol. 11, p. 897. 


Vv “Integrated Circuit,” Encyclopaedia Britannica, Encyclopaedia Britannica, Inc., Fifteenth Edition, 
1987, Micropaedia. Vol. 6, p. 337. 


vi http://webnuc.nuce.psu.edu/~jhm/20 I /lectures/people.html. Maintained by John Mahaffy : 
jhm@cac.psu.edu 


vii See http://www.isoc.org/internet-history 


Internet History Conference - Position Paper 
Nathan Myhrvold 


The Internet is rapidly becoming a key method for communication and the dissemination 


of documents and ideas. There are many reasons why this is so: 


* 


Email combines virtually instant transmission of information found in electronic 


communication means such as the telephone or television, with the asynchronous delivery 
found in a written letter. It is easy to scale the delivery from a single recipient to hundreds 
or even more through mailing lists. This combination is very attractive, and for many 
users has already displaced both the telephone and paper based letters. 


Web pages provide a very scalable method of publishing information to a wide audience. 
A web page can economically reach a very small audience, because it is inexpensive to 
create. Yet, should the site attract a large audience, it is easy and cheap to scale the 


delivery up to the point. 


Bulletin boards and news groups provide forums for discussion and interaction between 
multiple participants. Prior electronic communications media, such as the telephone 
excelled at one-to-one communication; radio and television are best at one-to-many 
(where many is typically measured in millions) communication. None of these prior 
electronic forms could efficiently handle interaction between small ad hoc groups of 
people - such as found in conferences, meetings, or less formal groups of people. 


Chat services provide direct, synchronous communications, which includes a complete 
log. 


Indexing and cataloging services allow easy access to information across the entire net. 


All of these aspects of the Intemet are remarkably cheap, both in the absolute, and in 
comparison with other media. 


These properties make the Internet a tremendous information resource. Technological 


trends suggest that the Internet will get a variety of new capabilities over time, such as the ability 
to easily deal with high quality video. The Internet is about all you could ask of an information 


resource. 


Except one thing: the Intemet is not naturally archival. 


Paper based publishing or information storage creates as a necessary by product, an 


archival record that can last for hundreds of years, if stored properly. In addition to the physical 
properties of the media, there is cultural support for archiving the material. A large 
infrastructure - the library system - has been set up to maintain an archival record of most paper 


y) 


based publishing (books, magazines and newspapers), and make it accessible to neatly any 5 ate 
Indeed, there are legal requirements to archive a copy of most kinds of printed material in a ee "e 
(such as the Library of Congress or British Museum Library) in order to fully enjoy the benefits 
copyright protection. Many private documents that are not published are nevertheless, stored 

as a matter of course. Examples include business records, government records, private 


correspondence etc. Some of these have legal requirements, others are simply stored out of 
tradition. 


In contrast to this situation, Internet based information creates no naturally archival z ecord 
as a natural by product. Virtually all web sites and bulletin boards reside on hard disks - which 
are read-write storage devices. These are routinely overwritten and reused. In addition, 
technological improvement leads to the rapid obsolesce of the computers and hard disks on which 


data on net is stored. Older material is not always moved over to new machines when this 
occurs. 


It is technologically feasible to create archival forms of the data on the Internet, but only if 
you deliberately set out to do so. There is no cultural or societal imperative to create archival 
material in the Internet community, whether in libraries or privately. In fact, just the opposite 
attitude exists. The rapid pace of technological change has created a cultural expectation of rapid 
update to the "latest and greatest" version of any sort of information. 


As a result, the early days of the Internet may well appear to future historians as a 
pre-literate society, for none of it will have remained for them to study. This lack of material will 
impede both the study of the Intemet itself, as well as the study of any societal or cultural 
trends that rely on the Internet. Today this is a substantial list, but within the next several 
decades, it will encompass nearly any aspect of our lives. The paper based historical record will 
be increasingly insufficient, because it will record a declining portion of the 
historical record. 


The alternative is to archive the Internet. We can create archival copies of every publicly 
accessible site on the net. Imagine snapshots of the net stored as a sort of time capsule. A 
future historian could browse or search the entire Internet, as of 1OAM, February 11, 1997, 
Every Usenet newsgroup and every web site would be present. 


Such an archive might seem prohibitively expensive - equal in cost to the entire net - but 
using digital tape it can be captured very cheaply. In that form it would be difficult to access, but 
at least it would be stored for the future. As computer storage technology continues to 
improve, it will become increasingly easy to put the snapshot on line. Within 10 years, the 
snapshot of the net as of February 11, 1997 will be quite small compared to the Internet circa 
2007 - probably just a few percent of it. 


Future historians will use advanced information retrieval software to explore, browse and 
correlate information in such a snapshot. They would be able to see the Internet as somebody in 


1997 would, or if they chose other views, be able to perform all sorts of analytical studies 
treating the entire snapshot as a corpus. Want to find out who started a particular idea, rumor or 
trend? Search for the first occurrence, then find all related instances. How prevalent was 
discussion of the presidential election on the net in 2000 versus 1996? Run cross 

comparisons by searching and cataloging sites from snapshot of both years. Traditional historical 
analysis will be possible, but so will many other new methodologies that are enabled by the 
information retrieval software. 


A series of such snapshots will in the long run be an invaluable record for future historians. 
Without deliberate archiving, the Internet information will be lost, and the historical record of the 
society based on it is therefore fragmented and incomplete. However, once you do 

archive the net, the situation reverses because you get a historical record that is larger and 
detailed than any that have been available. 


Fields other than history would find the archive useful as well. Linguists and 
lexicographers would find the corpus invaluable as a study of the evolution of language. 
Economists will be able to track the proliferation of web based commerce. Social scientists will 
get a vast store of information on popular culture. People in all walks of life use public libraries 
to access information in old newspapers, magazines and books - whether for school projects, 
nostalgia or whatever reason they desire. If people have a desire to access the past today - why 
won't they have similar desires in the future? 


So, my proposal is, quite simply, to archive the entire Internet. Digital tape and other off 
line means seem to be the cheapest means to store such an archive until a combination of funding 
and technology improvement make it feasible to place the archive online. Hopefully this can be 
as soon as possible, but given the long term nature of the project, priority must be given to 
capturing the data and saving it for posterity. 


The archiving can be done periodically, as a series of snapshots taken at a specific time (or 
over some small interval). An improvement on this scheme would be to supplement the periodic 
snapshots of the entire net with incremental changes taken over an even more frequent period. 
Existing web crawlers used by search and indexing sites are capable of capturing this information 


today. 


In the future, it is likely that new standards will be created to improve the efficiency of 
indexing, making the web crawler’s task easier. As that occurs, the needs of archiving should be 
incorporated. Change notification standards, which automatically note changes in sites, may make 
this task even easier in the future. In fact, standards might one day exist which are directly 
designed to make archiving easier. However, there is no point in waiting until then - a 
very serviceable archive can be captured today. 


When first told of this proposal, people are greatly tempted to suggest "improvements", 
most of which I find rather dubious. The most common suggestion is to edit the record and only 


save some a carefully selected set of material - ie that which is "important". The cliche of 
separating the "Wheat" from the "chaff" is frequently invoked. 


But, what is wheat and what is chaff? Who makes the determination of what is important? 
Or, more to the point, who is able to say today what sort of information will be useful to future 
scholars sifting through the archive? 


The lesson from archeology is quite clear - people are rather poor at making this 
determination. Archeologists rely more on trash dumps, refuse pits and toilet areas than they do 
on the "official" record. The things which people explicitly discarded as unimportant at the time 
is often the most usefill in reconstructing day to day life, or in learning what the typical person 
was doing. The elite of any society is interested in recording its actions - through burials, 
temples, monuments and written records. Society as a whole tends to be underrepresented, 
because it lacks the resources to store everything. 


It is hard to second-guess what technology, or what interests people will have in the 
future. The study of AIDS was greatly helped by stored samples of blood. Such a sample 
allowed the first recorded case to be traced to a British sailor who died of a mysterious illness in 
1952. With more samples, particularly in Africa, we would almost certainly know a great deal 
more about the disease and its origin. This case is not atypical - analysis DNA and a host of other 
technologies are helping piece together mysteries such as the social structure of Easter Island, 

"Otzi" the paleolithic man found frozen in the Alps, or the mummies of Incan children from the 
Andes. They certainly had no clue at the time what technology we would be using now. In fact, 
had the remains been discovered a decade or two ago, the scientists involved in the discovery 
might not have guessed that PCR, magnetic resonance imaging, CAT scans and computer analysis 
would become routine tools. 


Future information retrieval software, which is able to parse and understand human 
languages, could revolutionize the study of a large corpus of text, but we will not be able to apply 
such techniques to the historical record unless it is saved. 


The interests of historians are equally hard to guess up front. It is easy to say that 
historians would be interested in the Intemet postings done by Presidents of the United States, 
Nobel Prize winners, diplomats, writers and the leading intellectuals of our time. Future 
occupants of these lofty positions are on the net today - as children or undiscovered 
adults - making web sites, posting to newsgroups and creating a record that will only be 
interesting in retrospect. The same is true of political movements and organizations, businesses 
and social trends. We cannot know what to save until we know what is important - not important 
now, but in the future. 


The net is a very organic and connected community. A new trend, idea or bit of urban 
folklore can start one place, then rapidly spread. Although any one user may visit a very 
restricted set of sites, the users as a community have a great deal of overlap. Search tools can - 


by design or accidentally - expose anybody to anything. There is a short chain of connection 
between any site and any user. Any division of the net into “historically important" or "wheat 
sites versus the unnecessary chaff is artificial. 


I believe that it is incredibly dangerous to second guess future generations, and edit the 
historical record. We should archive all of the net that we possibly can. Ironically, it is probably 
cheaper and easier to store it all. Digital tape is cheap. Human time to categorize and edit is 


expensive by comparison. Leave the editing and selection for future generations - or their 
software agents. 


Draft of 2/10/97 
Assessing the Need: 1 
What Information and Activities Should We Preserve? 


Michael L. Miller” 


Abstract 

Archivists, records managers, librarians, and technology advocates have developed visions of what 
electronic records could and perhaps even should be preserved for various purposes. This paper looks at 
the issue from a very pragmatic point of view of someone who has worked in both a Federal agency that 
produces and disseminates huge quantities of information and an archival program that has attempted to 
deal with the flood of Federal records for roughly a quarter of a century. The author examines some 
questions underlying the title of this paper with the goal of highlighting areas of potential agreement and 
disagreement on how the issue of preservation should be addressed. The author argues that we need to 
look to limit the amount of documentation we seek to retain over time and match our preservation plans 
to our management capabilities at this time.3 


Introduction 

The request to write this paper brings to mind two stories. The first is about myself. When I was in the 
military back in the late 1960s, I bought a (for the time) very elaborate 35mm camera, with multiple 
lenses, filters, external flash and lots of other gadgets. As was typical of cameras of that time, this 
“fancy” camera did not have automatic light metering and focusing. After a few years of taking 
“artistic” photos, I found myself using the camera primarily for snapshots of friends, a task for which it 
was unsuited (in my hands anyway). I eventually bought a much more modest 35mm camera that was 
entirely automatic and fit my needs very well. 


The second story is much older, one of Aesop’s Fables, the story of A Boy and the Filberts.4 


“A boy put his hand into a jar of Filberts and grasped as many as his fist could possibly hold. 
But when he tried to pull it out again, he found he couldn’t do so, for the neck of the jar was too 
small to allow of the passage of so large a handful. Unwilling to lose his nuts, but unable to 
withdraw his hand, he burst into tears. A bystander, who saw where the trouble lay, said to him, 
“Come my boy, don’t be so greedy: be content with half the amount, and you’ll be able to get 
your hand out without difficulty.” 


Do not attempt too much at once.” 


What ties these two together in my mind is the need to think through very clearly exactly what we need 
to preserve and why, when it comes to electronic records. The question in the title of this paper is not 
new - it is at the core of the archival discipline. In its most basic form, the question of what we should 
preserve is one of determining value - value to whom, for what purpose(s) and for how long. Some 
would say it is the most critical question facing archivists, and most would agree that at least it is the 
most fun to debate. The volumes of literature on the subject bear witness to that fact.) This paper is not 
meant to resolve the ongoing debate, but rather to frame the debate and identify some of the critical 
issues from the perspective of someone with both records management and archival backgrounds. 


Draft of 2/10/97 
The question “What Information and Activities Should We Preserve?” seems simple enough, but the 
underlying issues are far more complex. I see three basic sets of questions: 


What is the “what” that we are considering preserving? 
What do we mean by “preserve” in this context? 
Who makes up the “we” that will be doing the preservation? 


The format of this paper will be to look at each of these themes in detail, raising a series of questions 
about the theme and then offering a position from the point of view of this archivist/records manager. 
Hopefully this will serve as the springboard for broader discussions. 


However, before even looking at what we might want to preserve, it is worth examining an underlying 
presupposition - namely that some electronic information is not worth preserving.© Some have indicated 
that the storage capacity of modern computers and the distributed nature of the Web would allow us to 
keep practically everything, making the question this paper addresses irrelevant. Traditionally archivists, 
records managers, and librarians have assumed that some documents would not be preserved. The 


question has always been how are the documents selected, and by whom, and for what purposes. Do 
electronic records resolve the need to be selective? 


In my view, selection will be necessary in the future. The only question is how that selection will be 
accomplished. Selection of documentation for preservation can be based on some definition of value - 
for organizational needs, specified purposes, potential or actual use, or other considerations. 
Preservation can also be based around management and cost issues so that older information is preserved 
only as long as it does not require extensive, or in some cases, any additional resources. In many cases, 
preservation is in fact random, especially the preservation of paper records, which can be “preserved” 
because people neglect them by throwing them into an attic or basement where they are discovered 
decades, even centuries later. Fortunately or unfortunately, the last scenario “preservation through 


neglect” is not an option for electronic records, making some agreement on what to preserve even more 
important than it was in the past. 


Examining the “What” 

Margaret Hedstrom subtitled one of her many articles on electronic records issues “Deciding What is 
Essential and Imagining What is Possible.”7 The challenge of imagining what is possible and deciding 
what is essential is crucial to all areas of electronic information, but especially when it comes to laying 
out the potential types of activities for documentation. It is here that we need to be sure to get beyond 
Terry Cook’s warning about the problem of using paper minds to evaluate electronic records.8 


The “what” question can be subdivided into two related questions. The first question is the “imagining 
what is possible part” and the second is the “deciding what is essential” part. 


What are the basic types of documentation that might be preserved and why? 


What qualities and functionality of that documentation warrant preservation and for what 
purposes? 


The question that is not included is what specific activities deserve to be documented. That is a 
deliberate omission. Whether, in the grand scheme of things, AIDS research is more worthy of 
2 


Draft of 2/10/97 

documentation than drag racing is nota question to be addressed in a general paper. All activities have 
their champions who can explain better than I why the record of their activity deserves to be preserved. 
My goal here is to lay out a framework for deciding what types of documentation to keep to meet your 
needs assuming that you have decided to document the activity. 


The following is a brief list of the types of documentation that should be considered for preservation, at 
least in the “imagining” stage. In some cases the categories overlap, but the goal here is ensuring breadth 
of coverage not taxonomic exactitude. The terminology may smack of archives-centricness, but 
hopefully the descriptions will allow all information professionals? to recognize the types of 
documentation under discussion. These categories are not presented because I think that they all need to 


be a but because each has been suggested for preservation at one time or another for certain 
reasons. 


Final Product Created for Use by External Audiences. This category would include any type of 
electronic product meant for use as an “information resource” by an audience external to the creating 
organization. Such final products may be only in electronic form or may be available in a variety of 
media. Examples include newspapers, books, reports, catalogs, data products such as Census files, final 
processed observational data, digital photographs, data products released as part of scientific 
collaboration, and Web pages. The basic characteristics of all of these items is that they are final, 
finished products meant primarily for use by others and not as part of documentation of the creating 


organization’s regular day-to-day activities.11 The primary goal of these products is to disseminate 
information. 


Final Documentation Needed as Evidence of a Transaction, Activity, or Program. This category would 
include final, finished documentation of the organization’s activities and programs. They are 
distinguished from the first category because their primary purpose is to conduct a “business transaction” 
not-to simply provide information to an external audience (although they may do that as well). The 
documentation may consist of a single document, a series of documents (file), or an entire recordkeeping 
system. Examples would include letters to constituents, a will or deed, an international treaty, bills for 
services, an electronic commerce interchange, a court decision, the records of a court case, or the 
administrative record documenting the development of a government regulation. The key characteristics 
of such documents are that they are used primarily to accomplish a business purpose, are relied upon by 
the organization as the documentation of their activity. Generally speaking, the organization or parties 
involved determine what constitutes the “official record” of the transaction based on law, regulation, or 
best practices. Once created, the record needs to be retained primarily as evidence that the transaction 
took place and was properly carried out. 


Although these two products differ by purpose and intended audience they share a number of 
characteristics. First, they are the “official positions” or statements of the creating organization. This 
makes them useful for some types of inquiries (e.g., what was the Supreme Court decision in a specific 
case), but less useful for others (e.g., how did the court really arrive at the decision). Second, as the 
official positions of the organizations, they may well deserve to be preserved, even after they have been 
superseded. Third, both types of products sit atop a pyramid of other documentation that supports them. 


Administrative Data. Consists of data created and collected by an organization to carry out its functions 

and activities. Such data will serve as a major information resource for the organization and will also be 

used for analytical purposes and to develop products discussed above. However, despite the variety of 
3 


Draft of 2/10/97 
uses to which they are put, all of these databases have an ongoing organizational functionality that 
requires that they be maintained over time to meet the business needs of the organization. Examples 
include budget, personnel, payroll, and other administrative systems; systems to track accounts, systems 
to track and manage programmatic functions such as the issuance of permits, systems to track the _ 
issuance of passports, and other types of systems that document how a specific organizational activity 1s 
managed. In their active life these systems support the work of the creating organizations, but they may 
be of interest to those researching specific areas documented by the system. 


Software. This category includes off-the-shelf commercial software, commercial software as modified 
by the organization to meet organizational needs, custom designed software, expert systems, computer 
models, and other software used by the organization. A record of software has two purposes. First, it 
allows information that is in a software dependent format to be used. Second, a knowledge of software 
used by an organization and the capabilities and limitations of the software is critical to understanding 
the abilities and inabilities of users to accomplish certain tasks. Some examples of how information 
about software and how it works may help. First, it has been the policy of the Environmental Protect 
Agency (EPA) and other Federal agencies to print to paper any e-mail message qualifying as a Federal 
record. To properly evaluate the resulting paper copies, it is necessary to know that the system printed 
the date the message was originally created, not the date it was sent, even when the two dates were 
different. At a higher level, a copy of the expert system used by EPA permit writers to develop permit 
documentation will reveal a lot about the process used to issue permits and why some applications would 
be denied. Finally a record of the changes to the massive modeling system used to produce the national 
economic models or to the system used to interpolate data in Census studies will be crucial to 


understanding the validity of the data. 


Explanatory Material Needed to Understand the Final Products. Electronic documents, information, or 


objects frequently are not self-explanatory. To use or understand them generally requires additional 
information that varies by the type of electronic record under discussion. For databases, one needs 
information about data elements and codes for example, while metadata about how and why a record was 
created is fundamental to understanding its role in the activity it documents. To be truly useful, scientific 
data needs contextual information about its accuracy, the instrumentation used to collect it, and data 
limitations. Bottom line is that to fully understand any document (electronic or otherwise) there must be 


sufficient explanatory material retained with it. 


Source Materials. This category includes those materials that were mined to produce a final product. It 
may include raw data that is later corrected, edited, and enhanced or finalized; databases from which 

subsets are extracted; geographic information system (GIS) data layers used to create specific products; 
and a data warehouse used to conduct and analysis, historical data used for projections, and data used in 


computer modeling. 12 


Product Development Materials. This category includes the documentation created during the 
development of a final product, but not required or judged to be part of the final product or official 
documentation. Examples include notes, working files, drafts, records of receipt of electronic mail 
messages, and other documents created during the business process. In the case of scientific data, it 
might include the results of experiments, measurements, or testing, not required to support publication of 
the results. In socio-economic analysis it might include data analyses, the results of computer modeling, 
and the data used to test hypotheses. In land use studies, it might be the results of GIS runs used to 
identify potential sites for a waste facility. Although these materials were judged not to be necessary for 


4 


Draft of 2/10/97 ‘ 
the final documentation of the activity in question, they were consciously created as part of the 


development process, and may provide valuable information about how a specific activity was carried 
out. 


Data Management Information. This category covers the documentation created by the computer system 
or application independent of the will of the individual user. In document creation it would include the 
basic information created on document size, initial creation date, the number of editing sessions, a record 
of changes made to the document during the session, and other information captured automatically. In 
developing a GIS product it is the record of how the actual processing took place. In the case of EDI 
transmissions it is the logs of the various translations, transfers, receipts, and other stages in the 
transmission and processing of the EDI transaction. Relating to the Web, it includes such diverse records 
as the record of the session and the sites visited, the records of what information was available on the 
Web at a particular time, and information captured about those visiting the Web site. In database 
applications it includes audit trails, records of who accessed the system, and changes to the data. 

Systems management information would include information about systems users, passwords, log-ons, 
and other information used to manage the system itself. 


Tying all of these different types of information together is the fact that they were created to aid in the 
management of computer systems and the reconstruction of data in case of system failure. However, all 
of them can be used to document various aspects of an individual user’s activities, and may, in some 
cases, actually shed light on how a transaction occurred or an activity was completed. As an example, 
proving the integrity of a recordkeeping system might involve keeping a record of access to the system, 
the passwords and PINs of persons having access, and any changes made to the data. 


Informal Electronic Communications. Much e-mail traffic consists of sharing of information and views, 
exploring areas of common interest, and personal communications of all sorts. For many, e-mail has 
replaced both the telephone and the letter as the preferred way to communicate with peers, family, and 
friends. One need only think of the variety of listservs, chat rooms, personal group lists, and informal 
collegue-to-collegue communications to understand that there is a wealth of communication going on 
that may be business related that is not linked to a specific “business transaction” or “product.” Even so, 
these communications may provide tremendous insight into the workings and cultures of many segments 
of our society. The informal and “private” nature of the medium means that many people are far freer in 
expressing themselves over e-mail than they are in other media. Although there are a series of issues 
relating to the privacy of persons involved in such exchanges, those who want to capture the flavor of the 
late 20th century will find them a wealth of information about how we live and work. 13 


What Does It Mean to Preserve? 

Preservation of electronic information has had a checkered history to this point. Those trying to preserve 
information electronically have had to deal with evolving hardware and software, fragile media, and the 
need to maintain not only the information itself, but the documentation and software necessary to support 
it. Whether the preservation is done by data libraries, data centers, or archives, it has presented 
problems. It is this history of limited success in maintaining electronic records that serves as the context 
for my remarks. While the future is brighter in this area, the present offers significant challenges that 
make me wary of attempting too much. 


ism Gen belek th _Draft of 2/10/97 

s my elie: t at the best documentation is the result of a clearly bounded understanding of the 
entity (€.g., organization, event, city, phenomenon, person, etc.) to be documented and a firm vision of 
what it is that is to be documented, and the level of detail in documentation that is sufficient. In 
government archives, the current formulation of this theory is known as functional appraisal, where the 
archivist looks at the major functional activities of an agency (e.g., issuing permits for pesticide uses) 
and determines what documentation should be preserved to document that activity. The short hand 
version of the theory is that the archivist has a better chance of preserving a useful record for the future if 
s/he simply selects the documentation of what the organization does that is important in its view, rather 
than trying to determine what records might be useful in the future based on the types of information 
they contain. There are exceptions, but this is the general rule.14 


For example, the Environmental Protection Agency (EPA) creates extensive files concerning each 
pesticide, its formulation, product testing, potential risks, and other aspects that are taken into 
consideration in reviewing the application. They were appraised based on the documentation they 
provide of how the Agency carried out an extremely sensitive function, as the strawberry growers in 
California know full well. However, historians of agriculture, science, labor, and issues that we haven’t 
even thought of yet, may find extensive information about their subjects in these files. 


This is not always an easy task. The National Archives and Records Administration (NARA) has 


developed its bounded area of concern - records of Federal agencies that constitute essential evidence. 
What that means in concrete terms is being investigated now.!5 This does however raise another bias of 
mine, which is I don’t believe in saving everything. It can be argued that every type of documentation 
might be useful in resolving some issue. In my mind, that does not justify keeping everything and letting 


technology allow us to store it and search it. Storage is not the major issue - but information 


management, ongoing access in the midst of changing technologies, and ability to comprehend what has 
been stored are very real issues. Therefore I would argue that we need a process to select what we will 


retain. 


n issue is to look at why we might seek to preserve 
1 uses of the information we preserve, arranged into what 
Level I is the simplest and Level III the 


The first step in understanding the preservatio 
information. I offer a few of the many potentia 
I perceive as levels of system functionality to serve those needs. 


most difficult. 


Level I: Provide information 
Provide answers to particular questions. 
Provide raw material for research 
Provide a historical “record” of what took place (corporate memory). 
Level II: Serve as the official, legal record of activities 
Support the ongoing operation of the business 
Serve as the legal record of the transaction or a portion of it. 
Permit oversight and support accountability 


Level III: Allow for the recreation of the transaction to check validity of results 


y more difficult and expensive to maintain. I would 


the information gradually move to Level I uses, and 
Therefore we should be 


I see each of these three levels as being progressivel 


also argue that, as time goes on, the primary uses of 
the number of users that have an interest in Level II and III uses will decline. 


Draft of 2/10/97 il as the 
thinking in terms of a strategy that gradually reduces the functionality of the systems as well a 
amount of information we preserve to keep them in line with research needs. 


countability and oversight. 


This viewpoint is not universally accepted. A major reason is that it limits ac ai aie 
f limitations can 


To the extent that we think in terms of official accountability for activities, a “statute of ln 
be applied that limits the length of time that records need to be maintained for accountability purposes. 
This “predictability” about the uses of the records is essential to their management. For example, the 
EPA limits the length of time it can require the regulated community to produce records required under 
regulations. This allows the regulated companies the ability to manage their records. Yet for those 
investigating wrongdoing - corporate knowledge of health risks from smoking for example - always want 
as much material maintained as possible forever, or until all of the issues have been resolved. This 
perspective finds all information potentially useful and would require the retention of all information for 
extended periods of time to ensure that it will be available “just in case.”!6 This desire runs counter to 
the need to rationally manage one’s records and information resources. 


If the final product sits upon a pyramid of supporting documentation, and the final product is an obvious 
candidate for preservation, the question becomes what supporting documentation (if any) should be 
maintained, for what period of time, in what format, and for what purpose. Each of these four aspects 
deserve detailed examination. Let me say at the outset that I would argue that there are instances when 
each of these documentation sources is worth preserving at least from a theoretical point of view. I 
would also argue that, in the vast majority of cases, the final products of the activities will serve the 
majority of research uses very well, especially as time goes on. The basic issue is to develop guidelines 
as to what the minimum level of necessary documentation might be. Those whose job it is to preserve 
information and/or records should not be deluded into thinking that what constitutes appropriate and 
indeed necessary documentation in some special cases should become the standard for documenting all 
activities. 


The first question then is the level of richness in documentation that we wish to maintain. The most 
basic question concerning electronic records and the level of richness to be preserved, is whether we see 
electronic records as a way of preserving more information about more activities or whether we see it as 
a way to preserve the same level of documentation in a format that is easier to access and use. The more 
we wish to preserve electronically the greater the management problems. 


Technically, the maintenance of final products is relatively easy, compared with the maintenance of 
supporting materials. The document creator specifically wanted to see the final products preserved, at 
least for a time, so these items are better documented and managed than the materials used to create 
them. Supporting documents may provide crucial information about the evolution of a document for 
example, or they may be a jumble of information that is difficult if not impossible to understand as 
context. The maintenance of supporting documentation works best when the period of preservation is 
relatively short and the scope of the attempt is relatively narrow. At this point in the evolution of 
electronic records our ability to identify and link all the supporting materials for a specific activity is 
limited in practice by two practical problems. Programs wishing to preserve supporting documents on a 
regular basis will have to deal with software that is geared toward allowing maximum capability to 
change and update documents and information, and records creators who are not used to thinking of the 
preliminary documents as records that need to be managed. 


Draft of 2/10/97 
The second question concerns the length of time the documentation should be preserved. Archivists 
used to speak of “permanent” records, but now speak of records of continuing value. In effect 
preservation begins at the moment of creation and it is simply a question of how long specific 
information or documents need to be retained - how long do they have continuing value?!7 Preservation 
becomes an ongoing process where the level of documentation will gradually diminish over time based 
on need. Viewed this way, it is very necessary to clearly lay out for the specific activities and classes of 
information what exactly is to be preserved and for how long. It is quite possible to agree that for some 
activities, relatively full documentation should be preserved for a specific period of time - several years 
perhaps, with the idea that unless there is some specific issue, a major investigation or problem with an 
audit, or a newspaper story, the documentation will be reduced in complexity after a specific period of 


time. 18 


The third question concerns the level of functionality that should be retained over time. In the case of 
word processing documents, must we preserve the capability to modify those documents, a capability 
(functionality) that the creator would have had? If we need to retain records of system audits until the 
transactions of a system have been audited, do we need to maintain them after that, or is a simple record 
of the findings of the audit sufficient? For example, it is possible to argue that for a period of time, 
researchers should be able to easily replicate the analysis done on a GIS in support of the siting of a 
waste-water treatment plant for example. However, preserving the complete functionality of the system 
rather than simply the results of the analysis can be difficult, especially if it must be maintained across 
system migrations. This is why the idea of gradually diminishing levels of documentation has always 
made sense. Yes, there would be full functionality to recreate the analysis for a while, because the actual 
decision may actually be in doubt for a while or subject to litigation. But given the costs of maintaining 
that functionality, it would be allowed to diminish over time as the need to document what actually took 
place and why, superseded the need to be able to second guess decision makers. 


Another commonly debated question of functionality is how we go about preserving documents. At one 
end of the scale we have ASCII text, which is easy to preserve, but does not capture the look and 

structure of the documents. A second strategy might be to use SGML as a standard and convert 
document to that (or HTML) for preservation. A third option would be to preserve the documents in a 
read only format using a viewer, or imaging the documents to preserve their format. A fourth is to ask 
whether electronic media are the best way to preserve records or outdated information over time. For 
documents that are not active and are difficult to maintain (older expert systems for example) does it ever 
make sense to preserve what is needed in paper form? Perhaps only the manuals explaining how the 
system was used would be sufficient. Even if there was a need to maintain electronic access, the manuals 
would be easier to maintain electronically that the systems themselves. Are all of these approaches viable 
options, depending on the needs of the users? Is it better to leave everyone to develop their own strategy 


or to try to set guidelines on when each is most applicable? 


Finally, we return to the question of purpose - why are we retaining these records. To the extent that we 
are attempting to preserve information that people can use to answer questions, there are a number of 
alternatives that would simplify preservation. A creating organization may have to retain extensive 
“overhead” to prove that the electronic records on their system are indeed the authentic records of the 
organization and are trustworthy in their content. But once the records have served their business 
function, is it still necessary to preserve the full “recordkeeping functionality” of the system? As an 
example, members of the regulated community producing reformulated gasoline can submit their 
mandatory reports via electronic data interchange (EDI). These submissions can be used for civil and 


8 


Draft of 2/10/97 

criminal prosecution so the ability to manage them as official records and ensure their complete accuracy 
Is very important. However, the EPA can only initiate criminal and civil penalties for a specific period 
of time - generally seven years. After that, the submissions still have value to those studying the 
reformulated gas industry and developing new regulations, but they cannot be used for prosecution. 
After the seven years, it makes sense to remove them from the recordkeeping system, and maintain them 
in a data library with proper security to ensure that the data were not changed. 


To sum up, preservation should be seen as an ongoing process, not only in terms of the need to perform it 
on an ongoing basis, but also because it requires constant reevaluation of what level of richness of 
information and functionality we need to preserve. It becomes a question of resources. The simpler the 
materials we choose to preserve and the more standard the formats, the easier our jobs will be. Given 
limited resources, we need to develop a clear set of priorities as to what actually should be preserved. 
These priorities should be based on predictable use of the records and a clear understanding of their 
value to particular constituencies for particular purposes. 


Who is the “We” That Will Do the Preservation? 

This question is the subject of a later paper, but to explain my approach to the preservation question, I 
need to at least pose my position on this matter. The “we” should be considered extremely broadly at 
least in the first instance. It should include records creators, data and traditional libraries, and archives of 
all descriptions. This broad-brush approach has two major advantages and one major disadvantage. The 
first advantage is a philosophical one. A diverse group of preservationists with differing interests and 
values for documentation will result in diverse things being saved, at least initially. A preservation 
program with a well-defined focus and clear boundaries will have a better chance of preserving the 
richness of events than an institution with an extremely broad preservation mandate. 


The second is a practical one. If the program preserving documentation in electronic form can focus on a 
specific defined area, it has a better chance to resolve the technical issues inherent in any electronic 
records program. An organization that has a single e-mail system can devise a system-specific means of 
preserving the messages from that system more easily that NARA can devise a system for preserving the 
products of multiple systems. 


The rather obvious disadvantage is that there is no guarantee that many of the “preservationist” programs 
will last themselves. As a result documentation may well be lost. This is certainly true, but the final 
result may not be any different than if preservation were left to a small select group of programs. If 
anything, the approach buys time, which in preservation is a major advantage. Theoretically, additional 
time may help sort out what documentation is actually necessary. My own view is that preservation, 
even archival preservation, is for the length of time the documentation continues to have value. 
Therefore I assume that the value of certain types of documentation currently preserved will diminish 
over time, resulting in certain records being discarded in any case.!9 From a technical perspective, the 
additional time allows for new approaches for preserving electronic records to evolve. 


However, we must be aware that records creators are usually not in the preservation business. They have 

their own work to accomplish and the records and information they generate are a by-product of that 

activity. They will preserve the records and information as long as it serves their purposes and the 

preservation does not entail significant costs. However, creating groups may balk at the expense of 

migrating older records no longer used to a new hardware or software platform. Records creators have a 
9 


Draft of 2/10/97 
difficult time understanding why it is necessary to preserve a record of the limited functionality of an old 
information system when they migrate to a new one. In sum, records creators may help for the early 
stages of the preservation process; but long-term preservation has been and will remain the work of 
specialists in preserevation. 


Conclusion 

This paper has attempted to lay out some of the major choices we face in determining what information 
and activities should be preserved to document the digital age. I am drawn back to the boy in Aesop’s 
fable who had to give up some of the Filberts that were in his grasp in order to get any at all. I find that 
many of us are enamored of what the technology can do - if applied as the vendors describe. However, 
many parts of our culture are far behind the cutting edge of technology. Partly it is because records 
creators do not choose to implement many of the options available to them, because they don’t see the 
need. Even when they do, the effectiveness of the technology solutions still stands or falls on the 
capability and dedication of the people implementing it. As only one example, it is clear that records 
creators will try to “outwit” records management software by entering insufficient indexing and metadata 
if they do not see the need for it or do not clearly understand how the system works. 


The parameters of the debate are bounded on one side by the ever-expanding potential of the technology 
and on the other by the much more slowly expanding capabilities of the people employing the 
technologies. The question comes down to not what should we preserve, but what can we feasibly 
manage at this point. Manage means two things. First is to physically preserve the information and 
records in a technical sense, ensuring that it is accessible despite the passage of time and changing 
technology. This means an ongoing program of preservation. But management also means that the 
information and documentation has the context to make it understandable as well as accessible. Today 
much of the context for electronic records is provided by the individuals who created them and worked 
with them. We will need to get beyond that limitation if we are going to be successful in preserving 
electronic records over the long haul. 


Endnotes 


1 This paper was prepared for the conference “Documenting the Digital Age,” February 12-15, 1997 and was 
submitted for pre-conference review on January 21, 1997. Due to the short lead time to prepare the paper for 
review, the author will make a second version available at the conference. 


2 Michael L. Miller became the Program Director for records Management at the National Archives and records 
Administration on January 6, 1997. From 1990 through 1996, he served as the National Program Manager for 
records and Agency Records Officer for the Environmental Protection Agency. From 1976-1990 was an archivist 
with several units in the National Archives including the [then] Machine-readable Records Branch and the Records 
Appraisal and Disposition Division. He is a frequent speaker on records management, appraisal, and electronic 
records issues, and has published a number of articles in the field. Mike received a Ph.D. in history form the Ohio 
State University in 1980. 

3 The views expressed in this paper are those of the author and should not be construed to reflect the official 
positions of either his current or former employers. 

VS. Vernon Jones, trans. Aesop’s Fables. (New York: Avenel Books, 1912) p. 61. 


Draft of 2/10/97 


> The bibliography on the subject of appraisal of electronic records is immense. It is probably the most written 
about aspect of electronic records. For some recent bibliography see Richard Cox, “readings in Archives and ; 
Electronic Records: Annotated Bibliography and Analysis of the Literature,” in Archives and Museum Informatics 
Technical Report No. 18 (Pittsburgh: Archives and Museum Informatics, 1993), pp. 99-156; and Luciana Duranti, 
“The Concept of Appraisal and Archival Theory,” American Archivist 57/2 (Spring: 1994), pp. 328-44; and Terry 
Eastwood, “How Goes It with Appraisal?” Archivaria 36 (Autumn 1993), pp. 111-121. ; 

The title of this paper uses the term “information” and I will use that term and the term documentation to include 
publications, databases, documents, information systems, and records. If I am discussing specific type of 
information, e.g., records, I will use the more specific term . os 

Margaret Hedstrom, “Descriptive Practices for Electronic Records: Deciding What is Essential and Imagining 
What is Possible.” Archivaria 36 (Autumn 1993), pp. 53-63. 
® Citation to Terry Cook’s article. 


The term “information professionals” is meant to be inclusive - to include librarians, systems managers, archivist, 
records managers and others who have an interest in the preservation of electronic information. 

The discussion also does not mean to indicate whether I consider these to be “records” from an archival point of 
view, merely that they exist and have some perceived usefulness . I will not attempt a definition of an “electronic 
record” here. This issues is well discussed in David Bearman, Electronic Evidence (Pittsburgh: Archives and 
Museum Informatics, 1994). I do want to make clear that I consider electronic records to be a small subset of all 
electronic information and subject to more rigid control and management. See Richard Cox, ed., University of 
Pittsburgh Recordkeeping Functional Requirements Project: Reports and Working Papers -- Progress Report Two 
(Pittsburgh: University of Pittsburgh School of Library and Information Science, 1995). 

*. [use the term “creating organization” to mean the entity responsible for creating the information product. That 

entity many be a person, organization, business, etc. 

+2 In some cases, the same data may appear in one place as a final product and in another as source materials, as in 
the case of demographic data provided by the Census Bureau as a final product which is used by a real estate firm as 
source data in its GIS to study housing patterns. 

3 Seminal work on the use of e-mail in research was published by Avra Michelson and Jeff Rothernberg, 
“Scholarly Communication and Information Technology: Exploring the Impact of Changes in Research Process on 
Archives,” American Archivist 55 (Spring 1992): 236-315. On the potential research uses of electronic 
communications in a university environments see Anne J Gilliland-Swetland and Carol Hughes, “Enhancing 
Archival Description for Public Computer Conference of Historical Value: An Exploratory Study, American 
Archivist 55 (Spring 1992), pp. 3160330. 

Mi good overview of the current discussion is found in Angelika Menne-Haritz, “Appraisal or Documentation: 
Can We Appraise Archives by Selecting Content?” in American Archivist 57/3 (Summer 1994), pp. 528-542. 

15 This same question lies behind the term “essential evidence,” a formulation for value used in the National 
Archives and Records Administration’s strategic plan. The question is “essential to whom.” 

mn fallacy in this argument is that many of the smoking guns that are found are the result of information not 
being properly managed. In many cases, the records survived due to stupidity on the part of the creator, or because 
the records were ignored for may years. Electronic records will not survive due to inattention over time. 

1” This follows the line of thinking generally known as the “records continuum.” For a recent overview of the 
idea see Jay Atherton, Corporate Memory and the Records Continuum,” unpublished paper presented at the 1996 
Annual Meeting of the Society of American Archivist. 

This type of reasoning is behind the traditional records management practice of dividing the documentation of 
an activity into the “official file” which would have longer retention, and the “working file” that would have a 
shorter retention. 

See generally, Terry Eastwood, “How Goes It with Appraisal?” Archivaria 36 (Autumn 1993), pp. 111-121. 


11 


ae gpamaaeLeageenBeRBRHteEaEiaieis 2 


How Do We Archive Digital Records?: 
The Report of the CPA/RLG Task Force 


by 


Donald J. Waters 


Associate University Librarian 
ale University 


January 27, 1997 


How Do We Archive Digital Records?: 
The Report of the CPA/RLG Task Force 


The Task Force on Archiving of Digital Information 


Rapid changes in the means of recording information, in formats for storage, in operating 
systems, and in application technologies threaten to make the life of information in the digital age 
much like life in Hobbes’ state of nature: “nasty, brutish, and short.” The Commission on 
Preservation and Access and the Research Libraries Group (RLG) created the Task Force on 
Archiving of Digital Information at the end of 1994 to help relieve building anxiety about the 
fragility of culturally significant digital information. The Commission and RLG asked the Task 
Force to frame digital archiving as a set of problems and tasks and to suggest an orderly, perhaps 
even manageable, approach to their resolution. 


The Commission and RLG selected members with a breadth of experience from a broad 
range of disciplines and backgrounds, including many from the research library community. I am 
sure that it was an accident, but as if to emphasize the strangeness of the new land they were 
asking the group to chart, the Commission and RLG selected two co-chairs -- me and John Garrett 
-- who are both anthropologists by training. In addition to research librarians and anthropologists, 
the Task Force included archivists, publishers, technologists, bibliographic service vendors, and 
legal and copyright specialists. The Task Force sponsors then asked the group to seek input from 
a still wider array of specialists and interested parties by issuing a draft report, distributing it 
widely, and inviting comment before composing a final report. 


The Task Force submitted its draft report in August 1995. The comment period formally 
ended on October 31, but in fact continued through January 1996. We received numerous 
thoughtful and helpful comments, suggestions and criticisms from many individuals. The 
international interest in the report was especially gratifying for those of us on the Task Force. We 
received extensive comments from a federation of libraries and archival agencies in Australia, from 
the Library networks and services Directorate of the Commission of the European Union and in 
January from the Consortium of University Research Libraries in Great Britain. 


The Task Force incorporated what we learned into our final report. We corrected the most 
flagrant errors and infelicities contained in the draft report and, in revisions and an extensive set of 
annotations, addressed most of the questions and additional issues that arose during the comment 
period. We completed our work and submitted our final report on May 1, 1996. 


The first of nine recommendations that the Task Force made in its final report called for “a 
cooperative project designed to place information objects from the early digital age into trust for use 
by future generations.” We argued that “action is urgently needed to ensure that documents, 
software products and other digital information objects that document the early digital age from 
1945 to 1990 are preserved before they slip irrevocably away” (Task Force 1996:38). The theme 
of this meeting ~ “Documenting the Digital Age” -- clearly overlaps and advances the interests of 
the Task Force and I am privileged as a member of that group to be a part of this discussion. 


As my contribution to this discussion, I want to stimulate your attention to the question of 
the means and prospects of digital archiving, to the question of how do we archive digital records? 
My central argument, following the Task Force report, is that “the problem of preserving digital 
information for the future is not only -- or even primarily -- a problem of fine tuning a narrow set 


Waters, 1/27/97 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 2 


of technical variables....Rather, it is a problem of organizing ourselves over time and as a society 
to maneuver effectively in a digital landscape. It is a problem of building -- almost from scratch -- 
the various systematic supports, or deep infrastructure, that will enable us to tame our anxieties and 
move our cultural records naturally and confidently into the future” (ibid.: 6). 


To develop our understanding of what is involved in building such a deep infrastructure for 
digital archiving, I ask you please to join me in thinking through the following chain of reasoning. 
We need first to accept the premise that archiving is central to the emerging knowledge-based 
economy. Without economy in the archiving of digital information there can be no real economies 
in the production and distribution of knowledge. Second, I want to suggest that economies in 
archiving depend on our understanding of the integrity of digital information objects and arise from 
the organizational requirements for preserving the integrity of those objects. And finally, I want to 
persuade you that the path to achieving a knowledge-based economy is actually to set the 
pages eis of digital archiving in motion as a pervasive and trusted foundation for cultural 

iscourse. 


The Value of Archiving in a Knowledge-based Economy 


Any discourse about economy, about the efficient management of scarce resources toward 
valued ends, is ultimately a discourse about values. Of what value or good, we must ask, is 
archiving and why should we push any scarce resources its way? This is a difficult question about 
purpose that may immediately open questions about and prompt defenses of particular forms of 
organization for archiving. In considering the answer, however, we must separate issues of 
purpose and function from those of organization. 


I note in passing here that the Task Force simplified matters greatly in the interest of clarity 
It consistently equated long-term preservation with archiving, and its report identifies digital 
archives, rather than digital libraries, as the unit of activity for the long-term preservation of digital 
materials. I maintain this usage here and it is a functional, not an organizational distinction. We all 
know that many libraries do frequently assume responsibility for the long-term preservation of the 
record of knowledge, but we have come to designate those that exercise such responsibility as a 
matter of course with special semantic markers as in the phrase “research library.” Moreover, 
although we now refer to “digital libraries,” discussion of such entities to date has made almost no 
reference to the long term value of their content nor to the mechanisms that might be employed to 
preserve such value over time. Rather than use the semantically marked phrase that Peter Graham 
(1995) has suggested, namely the “digital research library,” we adopted the simpler designation of 
“digital archive.” 


In answer to the question about the value of archiving, the Task Force report invokes a 
general “culture-at-risk” argument. Culture -- any culture, so the argument goes -- depends on the 
quality of its record of knowledge. If that record is defective, as it will be if urgent attention is not 
given widely to the preservation of information in digital form, then the quality of the culture is 
also at risk (Task Force 1996: 1-3). The Task Force called attention to the loss of records from the 
1960 census, which is a constitutionally-mandated activity, and highlighted other losses of 
culturally significant records. The Task Force intended its “culture at risk” argument to establish a 
case for the preservation of digital information as a general matter of public interest and policy. 


However, the culture-at-risk argument is a common one, perhaps too common, and is often 
invoked to attract attention -- and public money or philanthropy -- to a wide variety of issues that 
otherwise do not carry much economic or political weight. Is there any special force to the 
argument in this case? There is, I believe, and it emanates from a unique set of factors that are 
contributing to the emergence and development before our eyes of a broadly-based and powerful 
knowledge economy. To elucidate these factors, we must identify the principles underlying a 


Waters, 1/27/97 


Page 3 
How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force B 


knowledge economy as distinct from other kinds of economy and demonstrate the place of oe 
archiving among them. The basic principle that enables us to regard the knowledge saa 
distinct construct is the notion that the pursuit of knowledge is its own end. AS Icraft or yo ‘al 
review an analysis of the special force of the culture-at-risk argument based on this yaa 4 
principle, I turn for help to two unlikely sources: the works of Richard Lanham, a distinguishe 
professor of English at UCLA, and of Jaroslav Pelikan, the great religious historian at Yale. 


In The Electronic Word, among other recent publications, Richard Lanham (1993) has 
argued that the scarce commodity in a knowledge-based economy is not information. We are 
glutted with information. Rather, the scarce commodity is the human attention which gives 
information its structure, its usefulness and its value as knowledge. In Lanham’s scheme, human 
attention is labor, information technology is the means by which the labor is applied, and attention- 
structures designed to capture the interest of consumers, including students and other scholars, are 
the products of the labor and the technology. 


At its core, Lanham’s theory is an application of the labor theory of value to the knowledge 
economy. His unique contribution to the theory is his further argument that the discipline of 
thetoric provides the theoretical framework for systematically describing and evaluating the end 
products of this knowledge work, the attention-structures. Note, however, the distinctive quality 
that Lanham attributes to the knowledge economy: as attention-structures, or works of knowledge, 
capture attention, they beget further attention-structures. Knowledge begets knowledge; 
knowledge is its own end. 


In The Idea of the University: A Reexamination, Pelikan (1992) has produced one of the 
most eloquent and detailed critiques of the principle of knowledge as its own end and of the 
university, in which the principle has long provided the central operating concept. According to 
Pelikan, the principle of knowledge as its own end is merely one of a more comprehensive set of 
first principles that he calls the “intellectual virtues.” These virtues are essential for the 
development of knowledge, and include principles of free inquiry and intellectual honesty, an 
obligation to convey the results of research, and an affirmation of the continuity of the intellectual 
life, upon which each generation builds and to which it contributes in turn (ibid.: 32-56). Building 
on this set of first principles, Pelikan argues that the advancement of knowledge through research, 
the transmission of knowledge through teaching, the diffusion of knowledge through publishing, 
and the preservation of knowledge in scholarly collections are the four legs supporting any table 
made for the pursuit of knowledge; they particularly support the table that has come to be known as 
the research university (ibid.: 16-17, 78-133) 


Invoking the 19th century phrasing of John Henry Newman, Pelikan goes on to suggest 
that support for teaching, research and publication constitutes the “endowment of living [genius]” 
while efforts to preserve, or archive, knowledge by organizations like libraries, museums and 
archives, represent “the embalming of dead genius” (ibid.: 110). Lest the connotations of these 
archaic phrases give you pause, note that Pelikan is careful to distinguish embalming from 
entombing and his use of “embalming” is a colorful synonym for preservation and archiving which 
he takes to include all of the means necessary to make knowledge accessible to present and future 
generations. Moreover, he vigorously argues that “new knowledge has repeatedly come through 
confronting the old, in the process of which both old and new have been transformed” (ibid.: 120) 
Memory is not a warehouse, but an active process of re-categorizing based on previous 
categorizations. In the province of the knowledge economy that we know as the research 
university, the two motives at work -- embalming and endowment of genius, the looking backward 
in preservation and the looking forward in research, teaching and publication -- thus are 
inextricably linked and flow from the principle that the pursuit of knowledge is its own end: 
preserved work from past generations is a necessary foundation for present and future work 
which in turn defines the accessibility of the preserved work. . 


Waters, 1/27/97 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 4 


___ If we accept the argument that the emerging knowledge economy is founded on the 
principle of knowledge as its own end and that the broadly defined function of preserving, or 
archiving, the record of knowledge is essential to the pursuit of knowledge, then it follows that the 
emerging knowledge economy cannot survive without a provision for the archiving function. It is 
this logic that led the Task Force to assert as a fundamental principle that ‘information 
creators/providers/owners have initial responsibility for archiving their digital information objects” 
(Task Force 1996: 20). In a knowledge economy, where knowledge is both the source and 
outcome of labor, we presume archiving to be in the producers’ own self-interest. 


Did we on the Task Force believe that such self-interest is sufficient in all domains to meet 
the requirements of a larger knowledge-based culture? Not at all. The knowledge economy is in 
very early stages of development and such a virtuous outcome is by no means secure. 


The developing nature of the knowledge economy is palpable. We all feel it and we do, in 
part, because one of its fundamental characteristics is the force of democratizing the value of 
knowledge as its own end. Lanham marshals considerable evidence that the rapid expansion of the 
division of labor around digital technologies -- what George Gilder (1995; see also Bronson 
[1996]) calls the technologies of sand (for silicon chips), glass (for optical networks) and air (for 
wireless networks) -- has democratized the cultures it touches. The products of the knowledge 
economy -- the attention-structures -- are easier to generate and to use. They make knowledge 
more accessible. The markets for them continue to expand demanding more knowledge workers 
and creating more knowledge consumers who are broadly educated in the arts and sciences. 


In the US and elsewhere abroad, the pressure of these developments on the educational 
system, particularly the system of higher education, has been extraordinary and, at least, fourfold. 
First, the system must serve a growing number of students who by conventional standards need 
remedial training to advance through the curriculum. This pressure is, in part, an expression of the 
distinction between the haves and the have-nots. Second, the broader range of constituents in 
higher education, whether they need remedial training or not, presses for different approaches to 
the curriculum. The expression of this pressure appears, in part, in the form of debates over the 
place of multiculturalism on our campuses and in our curricula. Third, the division of labor in the 
knowledge economy has resulted in both increasing specialization within disciplines and the rapid 
growth of interdisciplinary study. And fourth, the system can only serve the broader range of 
constituents and interests by dramatically lowering the costs of education to affordable levels. 


Note, as Lanham does, that the dynamic described here represents a profound impulse to 
achieve the preeminent goal of education in a democracy (Lanham 1993: 23). That is, insofar as 
knowledge is both the source and outcome of human labor in this rapidly growing segment of the 
economy, literate citizens will prevail who value the lifelong pursuit of knowledge as its own end, 
as both the source and outcome of their labor. Yet, ironically, just as this democratic goal has 
come into plain view, what Donald Norman (1988: 1-33) calls the “psychopathology of everyday 
things” intrudes. Despite the declining costs of information technologies, long-promised 
productivity gains remain elusive, especially in higher education. Absent such gains, the 
cumulative result of the social and economic pressures to lower the cost of education feels to many 
of us in the business as if we are under siege and being asked simply to lower the quality of our 


products and services. 


Following Norman, I submit that the solution to the productivity paradox in the knowledge 
economy is a matter of design and development. Because digital information is not only a product 
but is also a source of knowledge, it will remain difficult and costly to use as long as its design 
makes it difficult or costly to maintain use, especially over the long term. Economy in the use of 
digital information, in other words, requires an economy of digital archiving. 


Waters, 1/27/97 


—————— 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 5 


Developing an economy for digital archiving 


If preservation is an essential feature of the knowledge economy, then real economies are 
necessary and must emerge in digital archiving for the knowledge economy truly to flourish. _ 
Posed in this way as a problem of economic development, those of us in the business of managing 
the record of knowledge for posterity can easily succumb to terror in the face of the explosion of 
digital information. I identify my own feelings with those of the woman so clearly captured in one 
of James Thurber’s wonderful cartoons. She sits before a bow-tied, pince-nezed, and long-eared 
doctor to whom she has come for help. He looks just like a rabbit and he observes: “You said a 
moment ago that everyone you look at seems to be a rabbit. Now just what do you mean by that, 
Mrs. Sprague?” (Grauer 1995: 148). Everywhere we look, there is digital information. How do 
we put ourselves as a culture in the position of identifying and giving sufficient attention to the 
digital material that is worth saving? 


Although the task overall may be daunting, we are not helpless and without places to start. 
Observe, for example, that the real intellectual action for at least a subset of scholarly disciplines no 
longer even occurs in the conventional publication stream but elsewhere: in on-line databases, on- 
line exchanges of pre-prints, listservs and so on. Conventional publication in these disciplines _ 
adds little value to the work that has already been disseminated in these other channels; rather it is a 
redundant process, undertaken to generate, in effect, a certified archival record of the work. 
Because the audience paying attention to the field has already seen and absorbed the work in on- 
line versions, the printed publication channel grows increasingly narrow consisting primarily of 
libraries who serve as the archival institutions. Given a narrow market, costs and prices 
consequently rise on the supply side. On the demand side, libraries respond by cutting titles from 
their collections (Waters 1996). 


There is clearly little logic or economy in a process whereby scholars use printed 
publications to establish an archival record only to find that the institutions responsible for ensuring 
that the archive endures for future generations cannot afford to purchase the publications. Framed 
in this way, the problems in the scholarly communication process that appear to us as a spiral of 
escalating prices and journal cancellations are archival problems. As such, they give research 
libraries, publishers, scholars and universities substantial economic motive to save money and 
streamline the process. Where there is redundancy between print and electronic form, as there 
increasingly is in disciplines such as mathematics and physics where pre-print markets flourish, we 
need to identify and capture the real intellectual activity from the on-line places wherever it is now 
naturally occurring and ensure that such activity is housed in certified, durable and readily 
accessible archives. In so doing, we can eliminate a substantial set of redundant costs and perhaps 
even enable our colleagues in the academy to change further the ways in which they conduct 
scholarship and also, perhaps, the mechanisms, such as tenure review, by which they measure the 
quality of that work. 


We will do ourselves and our colleagues no favors, however, if we replace a costly, 
redundant archival process with one for digital materials that is even more costly. The Task Force 
(1996: 9-36) identified a wide range of factors the interaction of which provides fertile ground for 
the development of economies in archiving. The factors include the various kinds of digital 
information objects -- text, images, numeric data, sound, video, simulations, geographic 
information systems, hypermedia and so on -- and the various claims of stakeholders with interests 
in the creation, management, dissemination, use and retention of digital information. Perhaps the 
most significant factors are those affecting the integrity of information objects in whatever form 
they may appear, and those required specifically for the organization of archives. 


The integrity of information objects. The central goal of preservation must be to preserve 
the integrity of the object. Knowing how to preserve a digital information object depends on being 


Waters, 1/27/97 


e CPA/RLG Task Force Page 6 


How Do We Archive Digital Records?: The Report of th 


that give it a distinct identity and define it as a whole and 
singular work. In the digital environment, the features that determine information integrity and 
which deserve special attention for archival purposes include the following: content, = . 
reference, provenance, and context. Choices about each of these features significantly affect the 


economy of archiving. 


able to define and preserve the features 


gital information objects range over a continuum 


At the lowest level of abstraction, preserving content simply means preserving a 


al choice at this level often means preserving the hardware and 
ts associated with a particular 


software that may be uniquely capable of inte reting the bi 
4 aero ep rele reserving the composition of ideas in a 


representation. 


Preserving the fixity of information objects is especially troublesome in the digital world, 
where objects are frequently subject to change or withdrawal. Outside the digital arena, there are 


various methods of fixing information in objects: business records contain evidence of 
rd specific radio and television programs, 


transactions, the acts of production and broadcast reco: 

and publishers generate specific versions or editions of works. In the digital environment, 
however, the use of cryptography and other techniques is still maturing to support digital archives 
in establishing trusted channels of distribution, and to help them discriminate among multiple 


versions and to identify canonical versions. Moreover, some digital information objects are better 


modeled as continuously updated databases for which the preservation choice is whether to 
compile a complete record of changes or to capture snapshots of the database as the means of 


preserving information integrity. 


Systems of citation, description, and classification provide the necessary means of 
reference for consistent discovery, identification, and retrieval of information objects over time. 
Preserving reference is thus an essential means of preserving the integrity of digital information, 
but it is problematic for several reasons. Self-referential information in digital objects seldom 
meets conventional citation quality. Moreover, consistently resolving names and locations of 
digital objects is, given the current state of the art, either difficult or unreliable. Finally, 
conventional reference mechanisms, such as on-line catalogs, do not easily accommodate certain 
kinds of reference data, such as information about the terms and conditions of licenses for 
intellectual property, which increasingly govern the use and cost of culturally-significant 


information objects in the digital world. 


Provenance is another essential feature of information integrity, and refers to the origin and 
chain of custody through individuals, organizations and _ arn a including within at 
archive itself. By documenting provenance, archives create the presumption that an information 
object is authentic. Compared to conventionally published objects, which employ well-known 
techniques for establishing their origin that are usually shown on a title page or its verso, the means 
of establishing the provenance of information published digitally are not yet well established. In 
addition, there are special problems in the digital world, as in other arenas, for establishing the 
provenance and authenticity of individual records, such as mail, diaries and personal databases, 


Waters, 1/27/97 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 7 


and of corporate records, the understanding of which depends fundamentally on an appreciation of 
their origins in policies, procedures, and organizational roles and responsibilities. Of special note 
are the integrity problems associated with digital information objects produced by digital 
instrumentation in scientific experiments, clinical services and remote sensing. Establishing 
provenance of these objects -- and thus their integrity -- requires a detailed understanding of the 
calibration, units of measure, sampling rate, recording conditions, and other features of the 
instrumentation that generated the information (see National Research Council: 1995a,b). 


The fifth attribute of information integrity that bears on the preservation of digital _ 
information objects is their context, the ways in which they interact with elements in the wider 
digital world. Among the various dimensions of interaction, there is a technical dimension, 1n 
which digital objects depend for their existence on specific hardware and software. There 1s also a 
dimension of linkages to other objects. In the World Wide Web, the integrity of many objects 
resides in the network of linkages. To preserve both the objects and the linkages is a daunting 
challenge for which there exists no good solution today other than to take periodic snapshots of the 
network objects. A communications dimension of information context defines the effects of the 
medium of transmission, such as CD-ROM or networks of varying bandwidth, on the types and 
characteristics of digital information objects. Finally, a social dimension, in which government 
policies, role relationships, and other political and organizational factors shape the creation and use 
of digital objects, also affects information integrity and the ability of archives economically to 
preserve it. 


The organization of archives. Another set of factors that the Task Force on Archiving of 
Digital Information identified as grounds upon which to develop an economy of digital archiving is 
the set of factors required specifically for the organization of archives. The digital environment 
today is so fragile that those who disseminate, use, re-use, re-create, and re-disseminate various 
kinds of digital information can easily, even inadvertently, destroy valuable information, corrupt 
the cultural record, and ultimately thwart the common pursuit of knowledge. Digital archives build 
and maintain reliable collections of well-defined digital information objects and they preserve the 
features -- content, fixity, reference, provenance, and context -- that give those objects their 
integrity and enduring value. They do so by managing costs and finances within an operating 
environment that has a core set of features including the means of migrating digital information to 
maintain its vitality as hardware and software environments change. 


Among the core set of features in the operating environment of digital archives is a selection 
and appraisal process. Archives cannot save everything. To identify the most valuable objects for 
preservation, archives must appraise the content of the object -- its subject and discipline -- in 
relation to the collection goals of the digital archives, the quality and uniqueness of the object, its 
accessibility in terms of available hardware, software and legal status, its present value, and its 
likely future value. Once an object is selected for inclusion, it needs to be accessioned -- that is, 
prepared for the archives. Accessioning involves describing and cataloging selected objects, 
including their provenance to authenticate them, and securing them for storage and access. 
Storage, depending on expected use and the kind of performance needed in retrieval, may be on- 
line in magnetic media, near-line in optical or tape media in a jukebox retrieval system, or off-line 
in media that requires manual intervention to retrieve. Access systems must facilitate discovery, 
retrieval, and use, including the management of intellectual property rights as appropriate, in a 
distributed, presumably networked, environment. Finally, digital archives need a high level of 
systems engineering skill to manage the interlocking requirements of media, data formats, and 
hardware and software, and to help determine when objects should migrate to new systems or 
system components. 


Migration is the periodic transfer of digital materials from one hardware/software 
configuration to another, or from one generation of computer technology to a subsequent 


Waters, 1/27/97 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 8 


generation. As the Task Force defined it, “the purpose of migration is to preserve the integrity of 
digital objects and to retain the ability for clients to retrieve, display and otherwise use them in the 
face of constantly changing technology” (1996: 5). Digital archives have various migration 
strategies available to them. Internally, they can build hardware or software emulators to preserve 
the technical operating environments of the information objects, they can change, or “refresh,” the 
media on which the objects are stored as Storage technology evolves, or they can reformat the 
objects to accommodate changing technology. In addition, they can work externally with creators 
so that digital information incorporates standards that simplify the migration issues. They can 
work with systems designers to engineer cost-effective migration paths into the hardware and 
software on which information objects depend. Finally, they can use processing centers that 
develop best practices and achieve economies of scale in certain kinds of migration techniques. 


Means of economizing. In this complex mix of factors by which digital archives operate to 
preserve the integrity of digital information objects, there is much room for the play of 
specialization, division of labor and competition that will not raise costs, but drive the economy of 
archiving vigorously to lower them. Division of labor and specialization are already evident. For 
example, some key services, such as rights management and network charging facilities, are 
emerging generally in the commercial marketplace and will undoubtedly serve the interests of 
archiving as well as other segments of the knowledge economy. The development of other 
services, such as durable naming conventions and expanded metadata facilities, are well underway. 
Still other kinds of specialized archival services -- those, for example, that require the complex 
weaving of information holdings in particular disciplines from among a variety of providers and 
custodians -- will require time and a commitment to a complex iteration and reiteration of 
exploration, development and solution as the relevant issues emerge and become clearer and more 
tractable. 


Fortunately, as we design these explorations, we have a rich experience from which it 
behooves us to draw. In the creation of information utilities like OCLC and the Research Libraries 
Group, libraries in the US came together to craft an economy out of an information management 
process in which we had formerly operated handicraft style in isolation and without the discipline 
of competition and the benefits of economies of scale. Just as we did two decades ago for 
bibliographic control, we have to find ways to invest our interactions over digital archiving with a 
marketplace dynamic that drives us to organize and routinize the activity and thereby continually to 
improve quality and lower cost. 


The process of coming to terms with each other, and with our partners in academia, in 
publishing and in the larger knowledge economy about the investments we must make in digital 
archiving is essentially a coming to terms about the centrality of archiving -- the embalming of dead 
genius -- in the pursuit of knowledge. But these understandings and agreements can be achieved 
only in actual practice. And this brings me to my third and final point: that our agreements to 
divide the labor as formal partners, as informal allies, even as competitors must set in motion soon 
and substantially the mechanics of digital archives as a pervasive and trusted foundation for cultural 
discourse. 


The Mechanics of Digital Archives 


There is an apocryphal story about the government service agency that formulated its record 
retention rules as follows: 1) discard all records when they become 30 years old; 2) retain all 
records over 50 years old, for their historical value (National Research Council, 1995b: ix). Most 
of the Task Force recommendations are designed explicitly to avoid the paralysis of this kind of 
thinking about the emerging knowledge economy. The recommendations for setting in motion the 
mechanics of digital archiving invite substantial cooperative action. They are grouped in three 
categories: pilot projects which focus on content, technologies, and the legal and economic 


Waters, 1/27/97 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 9 


barriers to archiving; support structures, including national policy, legal and institutional _ 
foundations for fail-safe mechanisms, notions of certification, scholarly and professional societies 
who need to worry about archiving the digital objects they care most about, and an international 
point of contact; and best practices for supporting archiving at the point of creation, for storage, for 
discovery and retrieval, and for migration. I draw your attention to three of the recommendations. 
They each illustrate a different form of interaction and they each yield a different kind of benefit. 


First, the Task Force called for certified digital archives. The process of certification 1s 
meant to create an overall climate of value and of trust about the prospects of preserving digital 
information. Repositories claiming to be digital archives in a changing and uncertain environment 
must be able to prove that they are who they say they are, and that they can deliveronthe 
preservation promise. There are at least two models of certification. On the one hand, there is the 
audit model used in the US, for example, to certify official depositories of government documents. 
The depositories are subject to periodic and rigorous inspection to ensure that they are fulfilling 
their mission. On the other hand, there is a standards model which operates, for example, in the 
preservation community. Participants claim to adhere to a given set of standards; consumers 
certify by their use whether the products actually adhere to the standards. The Task Force did not 
judge the merits of these alternatives. Instead, its call for individuals and organizations to agree to 
collaborate in the design and implementation of standards, criteria, and mechanisms for 
certification, and for prospective digital archives to submit to the certification process is a summons 
for the wider community to affirm the values -- at least in the abstract -- of digital preservation and 
ultimately of the pursuit of knowledge as its own end. 


The Task Force also emphasized the need for a fail-safe mechanism in digital archives. 
Such a mechanism will enable a certified archival repository to exercise an aggressive rescue 
function to save digital information that it judges to be culturally significant and which is 
endangered in its current repository. We may not know enough about the use of digital 
information to reach consensus just yet about what fair use of it is, but we do know that one of the 
greatest dangers to its long life is the ease with which it can be abandoned or destroyed. If 
concerted action is needed in the intellectual property arena to balance the rights of creators and 
publishers against the need to support teaching and research, then let us focus at least some of that 
action on the development of the legal framework needed to support a fail-safe mechanism for 
digital archives. The benefit of such action is, of course, not in the dollars it directly generates or 
saves, but in the environment it creates for archival institutions to do their job and to realize the 
value of preserved work for future generations. 


Finally, I call attention again to the overlapping interest of those at this meeting with the 
Task Force, whose members recommended a cooperative venture to preserve the documents, 
discourse, software products and other digital information objects that serve to record the early 
digital age. Because the objects in this focal area are at such risk of loss, the project could provide 
a useful means of exploring the actual operation of archival fail-safe mechanisms. Moreover, 
conceived as a cooperative venture among multiple participating archives, the project would 
provide a necessary testbed for developing an on-line system of linked but distributed archives. 
One of the biggest unknowns in the digital environment is the full impact of distributed computing 
over electronic networks. However, as the Task Force report suggests in the section on costs and 
finances, one of the greatest hopes for reducing costs in the scholarly communication process is the 
prospect of achieving economies of scale in the storage and distribution of electronic information 
over electronic networks. We need to verify these expectations of economic benefit in actual 
experience with a range of materials in archival settings. 


Waters, 1/27/97 


CEE eg eee ee eee 


ee | 


How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force Page 10 


Conclusion 


I conclude by observing that the notions of archives and archiving today have much 
currency and import, even outside the context in which we have been discussing them here. Such 
currency is evidence, perhaps, of the democratizing effects of the knowledge economy. In the 
New York Times Magazine at the end of 1995, William Safire devoted one of his “On Language” 
columns to the topic of kids’ slang. He advised that “if you want to stay on the generational 
offensive, when your offspring use the clichéd gimme a break, you can top that expression of 
sympathetic disbelief with jump back and the ever-popular riposte whatever.” However, he noted 
that some expressions, such as I’m outta here or I’m history, are now very much dated. I’m 
history, Safire quoted a forthcoming study of slang, is “a parting phrase modeled on an 
underworld expression referring to death” -- remember what I said about embalming -- and the 


phrase has both inspired and been replaced by the more trendy expression, /’m archives (Safire 
1995: 30). 


With regard to the future of digital information in the pursuit of knowledge, Ihaveno _ 
doubt that the expression J’m archives will apply truthfully to all the institutions represented in this 
conference. The choice before us, both individually and collectively, is to decide in what sense it 
will apply. 


Waters, 1/27/97 


Page 11 
How Do We Archive Digital Records?: The Report of the CPA/RLG Task Force 


References 


Bronson, Po 
1996 “George Gilder.” Wired 4.03 (March): 122-126, 186-195. 


Gilder, George 
1995 “Angst and Awe on the Internet.” Forbes ASAP (December 4). 
Available: [WWW http://homepage.seas.upenn.edu/~gaj 1/ggindex.html]. 


Graham, Peter eer 
1995 “Requirements for the Digital Research Library.” College and Research Libraries 
56(4): 331-339. 


Grauer, Neil a 
1995 Remember Laughter: A Life of James Thurber. Lincoln: University of Nebraska Press. 


Lanham, Richard A. 


1993 The Electronic Word: Democracy, Technology and the Arts. Chicago: The University 
of Chicago Press. 


National Research Council 
1995a Preserving Scientific Data on Our Physical Universe: A New Strategy for Archiving the 


Nation’s Scientific Information Resources. Washington, D.C.:National Academy 
Press. 


1995b Study on the Long-Term Retention of Selected Scientific and Technical Records of the 
; Federal Government: Working Papers. Washington, D.C.:National Academy Press. 


Norman, Donald 
1988 The Psychology of Everyday Things. New York: Basic Books, Inc. 


Pelikan, Jarislov 
1992 The Idea of the University: A Reexamination. New Haven: Yale University Press. 


Task Force on Archiving of Digital Information 


1996 Preserving Digital Information. Report of the Task Force on Archiving of Digital 
Information. Washington, D.C.: Commission on Preservation and Access and 
Mountain View, CA: The Research Libraries Group, May 1. 
Also available: [WWW:http://www-rlg.stanford.edu/ArchTF]. 


Safire, William 
1995 “Kiduage.” The New York Times Magazine. October 8, 1995: 28, 30. 


Waters, Donald J. 


1996 Realizing Benefits from Inter-Institutional Agreements: The Implications of the Dra 


Report of the Task Force on Archiving of Digital Information. Washi ift 
Commission on Preservation and Access. a hington, D.C.: ‘The 


Also available: [WWW:http://arl.cni.org/arl/proceedings/ 127/waters.html]. 


Water<r 177/700 


Archiving the Internet: 


Bold efforts to record the entire Internet 
are expected to lead to new services 


Brewster Kahle 
Internet Archive 
11/4/96 


exandria were burned, much of early printing was 
led for their silver content. While the Internet's 


World Wide Web is unprecedented in spreading the popular voice of millions that would 
ded these documents and images from | year 


never have been published before. no one recor 

ago. The history of early materials of each medium is one of loss and eventual partial 
reconstruction through fragments. A group of entrepreneurs and engineers have determined 
to not let this happen to the early Internet. 


The early manuscripts at the Library of Al 
not saved. and many early films were recyc 


Even though the documents on the Internet are the easy documents to collect and archive. the 
average lifetime of a document is 75 days and then it is gone. While the changing nature of 
the Internet brings a freshness and vitality, it also creates problems for historians and users 
alike. A visiting professor at MIT, Carl Malamud. wanted to write a book citing some 


documents that were only available on the Internet’s World Wide Web system. but was 
concerned that future readers would get a familiar error message 404 Document not found” 
“too unreliable” for 


by the time the book was published. He asked if the Internet was © 
scholarly citation. 
and periodicals that are no longer sold or easily 


accessible. no such equivalent yet exists for digital information. With the rise of the 
importance of digital information to the running of our society and culture. accompanied by 
the drop in costs for digital storage and access, these new digital libraries will soon take 


shape. 


Where libraries serve this role for books 


rganization that is collecting the public materials on the 
Internet to construct a digital library. The first step is to preserve the contents of this new 
medium. This collection will include all publicly accessible World Wide Web pages. the 
Gopher hierarchy. the Netnews bulletin board system. and downloadable software. 


The Internet Archive is such a new o 


If the example of paper libraries is a guide, this new resource will offer insights into human 
endeavor and lead to the creation of new services. Never before has this rich a cultural 


artifact been so easily available for research. Where historians have scattered club 
newsletters and fliers, physical diaries and letters. from past epochs. the World Wide Web 
offers a substantial collection that is easy to gather. store, and sift through when compared to 
its paper antecedents. Furthermore. as the Internet becomes a serious publishing system. 


Scientific American Article Draft Internet Archive 


then these archives and similar ones will also be available to serve documents that are no 


longer “‘in print”. 


Apart from historical and scholarly research uses, these digital archives might be able to help 
with some common infrastructure complaints: 

e Internet seems unreliable: “Document not found” 

«Information lacks context: “Where am I? Can I trust this information?” 

° Navigation: “Where should I go next?” 
When working with books, libraries help with some of these issues, with “the stacks” of 
books, links to other libraries and librarians to help patrons. 


Preservation of our Digital History 


Where we can read the 400 year-old books printed by Gutenberg. it is often difficult to read 
a 15 year-old computer disk. The Commission for Preservation and Access in Washington 
DC has been researching the thorny problems faced trying to ensure the usability of the 
digital data over a period of decades. Where the Internet Archive will move the data to new 
media and new operating systems every 10 vears. this only addresses part of the problem of 


preservation. 


Using the saved files in the future may require conversion to new file formats. Text. 
images. audio. and video are undergoing changes at different rates. Since the World Wide 
Web currently has most of its textual and image content in only a few formats, we hope that 
it will be worth translating in the future. whereas we expect that the short lived or seldom 
used formats not be worth the future investment. Saving the software to read discarded 

- formats often poses problems of preserving or simulating the machines that they ran on. 


The physical security of the data must also be considered. Natural and political forces can 
destrov the data collected. Political ideologies change over time making what was once 
legal becomes illegal. We are looking for partners in other geographic and national locations 
to provide a robust archive system over time. To give some level of security from 
commercial forces that might want exclusive access to this archive. the data is donated to a 
special non-profit trust for long-term care taking. This non-profit organization is endowed 
with enough money to perform the necessary maintenance on the storage media over the 


years. 


Packaging enough meta-data (information about the information) is necessary to inform 
future users. Since we do not know what future researchers will be interested in. we are 
documenting the methods of collection and attempt to be complete in those collections. As 
researchers start to use these data. the methods and data recorded can be refined. 


Technical Issues of Gathering Data 


Building the Internet Archive involves gathering, storing, and serving the terabytes of 
information that at some point were publicly accessible on the Internet. 


fA rg eer ae 
Scientific American Article Draft Internet Archive 3 


a 


Gathering these distributed files requires computers to constantly probe the servers looking 
for new or updated files. The Internet has several different subsystems to make information 
available such as the World Wide Web (WWW), File Transfer Protocol (FTP), Gopher, and 
Netnews. New systems for three-dimensional environments, chat facilities, and distributed 
software require new efforts to gather these files. Each of these systems requires special 
programs to probe and download appropriate files. Estimating the current size, turnover, 
and growth of the public Intemet has proven tricky because of the dynamic nature of the 
systems being probed. 


Protocol Number of Sites Total Data § Change rate 

WWW 400.000 1,500GB 600GB‘month 

Gopher 5.000 100GB declining (from Veronica Index) 
FTP 10,000 5.000GB not known 

Netnews 20,000 discussions 240GB 16GB/month 


The World Wide Web is vast. growing rapidly, and filled with transient information. 
Estimated at 50 million pages with the average page online for only 75 days. the tumover is 
considerable. Furthermore. the number of pages is reported to be doubling every vear. 
Using the average web page size of 30 kilobytes (including graphics) brings the current size 
of the Web to 1.5 terabvtes (or million megabytes). 


To gather the World Wide Web requires computers specifically programmed to “crawl” the 
net by downloading a web page. then finding the links to graphics and other pages on it. and 
then downloading those and continuing the process. This is the technique that the search 
engines. such as Altavista. use to create their indices to the World Wide Web. The Internet 


"Archive currently holds 600GB of information of all types. In 1997 we will have collected a 


snapshot of the documents and images. 


The information collected by these “crawlers” is not. unfortunately. all the information that 
can be seen on the Internet. Much of the data is restricted by the publisher. or stored in 
databases that are accessible through the World Wide Web but are not available to the 
simple crawlers. Other documents might have been inappropriate to collect in the first 
place. so authors can mark files or sites to indicate that crawlers are not welcome. Thus the 
collected Web will be able to give a feel of what the web looked like at a particular time. but 
will not simulate the full online environment. 


While the current sizes are large. the Intemnet is continuing to grow rapidly. When it is 
common to connect one’s home camcorder to the upcoming high bandwidth Internet. it will 


not be practical to archive it all. At some point we will have to become more select what data 
will be of the most value in the future, but currently we can be afford to gather it all. 


Storing Terabytes of Data Cost Effectively 


Crucial to archiving the Internet. and digital libraries in general, is the cost effective storage 
of terabvtes of data while still allowing timely access. Since the costs of storage has been 


Te mR reer eres ir 
Scientific American Article Draft Internet Archive 3 


dropping rapidly, the archiving cost is dropping. The flip side, of course, is that peopleane 
making more information available. 


To stay ahead of this onslaught of text, images, and soon video information we believe we 
have to store the information for much less money than the original producers paid for apeie d 
storage. It would be impractical to spend as much on our storage as everyone else combinee. 


Storage Technologies Cost per GigaByte Random access time 
Memory (RAM) $12,000/GB 70nanoSeconds 
Hard Disk $200/GB 15miliSeconds 
Optical Disk Jukebox $140/GB 10seconds 
Tape Jukebox $20/GB 4minutes 
Tapes on shelf $2.GB human assistance required 


(1 GigaByte = 1000 MegaBvtes. 1 TeraByte = 1000GigaBytes. A GigaByte is roughly 
enough to store 1000 books or 1 hour of compressed video) 


With these prices. we chose hard disk storage for a small amount of the frequently accessed 
data combined with tape jukeboxes. In most applications we expect a small amount of 
information to be accessed much more frequently than the rest. leveraging the use of the 
faster disk technology rather than the tape jukebox. 


Providing Access and New Services 


After gathering and storing the public contents of the Internet. what services would then be 
of greatest value with such a repository? While it is impossible to be certain. digital versions 
‘of paper services might prove useful. 


For instance. we can provide a “reliability service” for documents that are no longer 
available from the original publisher. This is similar to one of the roles of a library. In this 
wav. one document can refer. through a hypertext link. to a document on another server and 
a reader will be able to follow that link even if the original is gone. We see this as an 


important piece of infrastructure if the global hypertext system is to become a medium for 
scholarly publishing. 


Another application for a central archive would be to store an “official copv of record” of 
public information. These records are often of legal interest. helping to determine what was 
said or known at a particular time. 


Historians have already found the material useful. David Allison of the Smithsonian 
Institution has used the materials for an exhibit on Presidential Election websites. which he 
thinks might be the equivalent to saving videotapes of early TV campaign advertisements. 
David Eddy Spicer of Harvard’s Kennedy School of Government has used the materials for 
their “case studies” in much the same way they collect old newspapers articles to capture a 
point in time. 


Scientific American Article Draft Internet Archive 


With copies of the Internet over time and cross correlation of data from muhtple sources, 
new services might help users understand what they are reading, when it was created, and 
what other people thought of it. With these services, people might be able to give a context 
to the information they are seeing and therefore know if they can trust it. Furthermore, the 
coordination of this meta-information and usage data can help build services for navigating 
the sea of data that is available. 


Companies are also interested in saving similar information and building similar services 
based on their internal information to help employees effectively lean from the experiences 
of others. 


The technologies and the services that will grow out of building digital archives and digital 
libraries could lead towards building a reliable system of information interchange based on 
electrons rather than paper. Using the “library” might be done many times a day to use 
documents that are no longer available on the Internet. 


Legal and Social Issues 


Creating an archive of informal and personal information has many difficult legal and social 
issues even if the material was intended to be publicly accessible at some point. Such a 
collection treads into the murky area intellectual property in the digital era. What can be 
done with the digital works that are collected gets into the area of copyright. privacy. 
import export restrictions, and possession of stolen property. 


To give a few examples: what if a college student made a web page that had pictures of her 
~ then-current boyfriend, but later wanted to take it down and “tear it up”. vet it lived on in 
digital archives (whether accessible or not). Should she have the right to remove that 
document? Should a candidate for political office be able to go back 15 years to erase his 
postings to public bulletin boards that have been saved in the Archive? What if a software 
program that is legal to publish in Denmark. but illegal in the United States is collected by 
an archive: should this program be removed and hidden even from historians and scholars? 
The legal and social issues raised by the construction of the Archive are not easily resolved. 


By allowing authors to exclude their information from the Archive we hope to avoid some of 
the immediate issues. and allow enough time to pass to understand the larger issues at hand. 


The Internet Archive might be able to help resolve some of these issues by publicly drawing 


the issues out and by participating in the debates. While many of these questions will take 
years to resolve, we feel it is important to proceed with the collection of the material since it 


can never be recovered in the future. 


Where does it go from here? 


The new technologies and services currently being created might be useful in all digital 
libraries and help make the Internet more robust and useful. 


Scientific American Article Draft Internet Archive 


tn 


Through an archive of what millions of people are interested in making public, we might be 
able to detect new trends and pattems. Since these materials are in computer readable form. 
searching them, analyzing them, and distributing them has never been easier. A variety of 
services built on top of large data sets will allow us to connect people and ideas in new way’. 


For instance, Firefly Inc. is using the individual tastes in music and movies to help suggest 
other CD's and videos based on finding “similar” people. They have even found that people 
are interested in communicating with the other “similar” people directly thus forming 
communities based on similar interests. This kind of computer matchmaking which is based 


on detailed portraits of people's preferences suggests similar services based on reading 
habits. 


Trends in academic fields might be able to be detected more easily by studying gross 
statistics of the communications in the field. The hypertext links of the World Wide Web 
form an informal citation system similar to the footnote system already in use. Studying the 
topography of these links and their evolution might provide insights into what any given 
community thought was important. , 


If archiving cultural and personal histories become useful commercially. then the efforts can 
be expanded to record radio and video broadcasts. These systems might allow us to study 
these effects and influences on our lives. 


Current terabyte technologies (storage hardware and management software) are relatively 
rare and specialized because of their costs. but as the costs drop we might see new 


applications that have traditionally used non-computer media. For instance. 


e A video store holds about 5.000 video titles, or about 7 terabytes of compressed 
data. 
e Amusic radio station holds about 10.000 LP’s and CD’s or about § terabytes of 
uncompressed data. 
e The Library of Congress contain about 20 million volumes. or about 20 
terabytes text if typed into a computer. 
e¢ Asemester of classroom lectures of a small college is about 18 terabvtes of 
compressed data. 
Therefore the continued reduction in price of data storage. and also data transmission. could 
lead to interesting applications as all the text of a library, music of a radio station. and video 
of a video store become cost effective to store and later transmitted in digital form. 


In the end. our goal is to help people answer hard questions. Not “what is my bank 
balance?”. or “where can I buy the cheapest shoes”, or “where is my friend Bill?” - these 
will be answered by smaller commercial services. Rather, answer the hard questions like: 
“Should I go back to graduate school?” or “How should I raise my children?” or “What 
book should I read next?”. Questions such as these can be informed by the experiences of 
others. Can machines and digital libraries really help in answering such questions? In the 
long term, we believe ves, but perhaps in new ways which would have importance in 
education and day-to-day life. 


SS ele me 
Scientific American Article Draft —_ Internet Archive 6 


Further Reading: 


Preserving Digital Objects: Recurrent Needs and Challenges, December 1995 presentation 
at 2nd NPO conference on Multimedia Preservation, Brisbane, Australia. 
http://community.bellcore.com/lesk/auspres/aus. html 


The Vanished Library, Luciano Canfora. University of Berkeley Press, 1990. 


Biography: 


Brewster Kahle is a founder of the Internet Archive in April 1996. Before that, he was the 
inventor of the Wide Area Information Servers (WAIS) system in 1989 and founded WAIS 
Inc in 1992. WAIS helped bring commercial and government agencies onto the Internet by 
selling Internet publishing tools and production services to companies such as Encyclopaedia 
Britannica. New York Times, and the Government Printing Office. 


Schooled at MIT (BSEE °82), Brewster designed super computers in the 80's at Thinking 
Machines Corporation. 


Scientific American Article Draft Internet Archive 


Internet Archives and Copyright 
Professor |. Trotter Hardy’ 


Copyright Law Overview 


Copyright law can be thought of in three parts: What Kinds of “things” can be copynghted? 
a ke one get a copyright in those things? And what rights does a copyrignt owner 
ave’: 


Note that just because something is the “kind of thing” that can have a copyright does not 
necessarily mean that it does have a copyright. For example, a “Shakespeare play’ is the 
kind of thing that can be copyrighted, but it has long since passed into the public domain (in 
fact, it was written at a time in English history when there was no such thing as “copyright” in 
any event). 


What kinds of things can be copyrighted? 


Copyright applies to “original works of authorship.” That phrase has historically been given a 
very broad reading, so that quite a lot of things are at least potentially subject to copyright 
protection. The Copyright Act lists examples of types of “works” that qualify: literary works: 
musical compositions; dramatic works; pantomimes and choreographic works; pictorial, 
sculptural. and graphic works; audio-visual. works; sound recordings; and architectural 
works. Even these fairly extensive examples do not convey the full extent of copyright's 
scope. For example. 


(1) “literary works" includes not only what would ordinarily be thought of as ‘literature’ 
(novels, poems, stories, essays, etc.) but also any “works expressed in words, numbers, 
or other verbal or numerical symbols or indicia,” which includes computer programs and 
occasionally even things like lists of company inventory numbers. 

(2) “Musical compositions" includes both lyrics and melody or tune. 

(3) “Dramatic works" includes things like plays. screenplays, and TV scripts. 


(4) “Pantomimes and choreographic works’ includes ballets, dances, or mime skits. 


(5) “Pictonal, sculptural, and graphic works” includes not only paintings, drawings, and 
prints, but also sculpture, maps, diagrams, blueprints, and other technical drawings. 


(6) “Audio visual works” includes movies, film stnps, slide presentations, and presumably 
mutt-media CD-ROM packages. 


1 -9.1997 |. Trotter Hardy. Professor Hardy is Professor of Law at the Wiliam and Mary School of Law, Willamsburg, 
Virginia 23187. E-mail address: thardy@facstaff.wm.edu. 


HARDY 


COPYRIGHT AND ARCHIVES PAGE 2 


(7) “Sound recordings” includes the actual sounds in a recording—that is, the particular 
tempo, arrangement, dynamics, and so on, as distinct from the musical composition. A 
“musical composition” can exist solely as sheet music; but a “sound recording” has to be 
an actual record of sound: vinyl record, audio-cassesste tape, audio CD, etc. 


(8) “Architectural works” includes the design of a building, as represented by the building 
itself, or by the plans. The plans would also be copyrightable as a “pictorial, sculptural, 
and graphic work" (in this case, a pictorial or graphic work). 


What cannot be copyrighted? 


The list of things that are potentially subject to copyright's protection is so extensive that 
people often wonder what isn't copyrighted. Things that are not copyrighted include: 


e Ideas and facts 

e Works whose copyright term has expired 

e Works of the U.S. Goverment 

e Laws (statutes, cases, regulations, constitutions) 

e Things authors have dedicated to the public domain 


The Copyright Act specifies that 


“IN no case does copyright protection extend to any idea, procedure, process, 
system, method of operation, concept, principle, or discovery.” 


In other words, if you have a great idea and write it up in an article, others are free to talk 
about and write about your idea—they just can't copy any substantial part of the words in 
your article to do so. The same thing applies to “facts,” which would fall in the area of what 
the law calls “discoveries.” If you as a historian learn amazing things about some historical 
event, things that no one else knows, you may write an article about the event, and you may 
stop others from reproducing your article's words, but you may not stop others from sharing, 
talking about, writing about, etc. the new facts that you discovered. 


The list of potentially copyrightable things is limited to potentially copyrightable things. A 
novel may be copyrighted, for example, but if the term of copyright has expired, then the 
novel ‘falls into the public domain” and thereby loses all copyright protection. Works in the 
public domain may be freely copied in whole or in part. 


The exact term of a copyrighted work depends on when it was written or published. In 
general, before 1978 (when the law was changed) copyright lasted for 28 years, and could 
optionally be renewed for a second term of 28 years—a possible 56 years total. After 1978, 
the term is no longer a fixed number of years but lasts for the lifetime of the author plus 
another 50 years. Proposals are afloat to lengthen that term to ‘life plus 70 years.” 


One reason for the change from a fixed term of years, to a variable term that depends on the 
author's life span, is that all of an author's works will go into the public domain at one time— 
50 years after the author's death. 


HARDY 


COPYRIGHT AND ARCHIVES RAGE? 


Also, when a book or other copyrighted work has been published for at least 75 years, yOu 
can wtite to the Copyright Office and ask whether the Office has any record that the author is 
living or dead. If the Office has no such records, you may conclusively presume that the 
author has been dead fo 


r at least SO years and you cannot be sued even if this presumption 
later tums out to be wrong. 


Another big exception to copyright is “works of the U.S. govemment.” U.S. Goverment 
reports, memos, documents, rules, agency publications, and the like are not copyrighted 
and may be freely copied and used by anyone. If you want to repackage and sell IRS 


bulletins on tax law or Agriculture Department booklets on milking cows, you may (others 
have). 


State governments are not prevented from Copyrighting their materials by the Copyright Act 
But court decisions have long since established that “laws” (including statutes, court 
opinions, and regulations) are in the public domain, And as a practical matter, most states 


do not attempt to assert Copyright over anything except perhaps computer software or other 
high-cost items. 


How do you get a copyright? 


You don't “get? a copyright: this is a common misconception. You either “have” a copyright 
Or you don't. If you create something original that is copyrightable, and you ‘fx’ it in some 
medium of expression (write it own; sing it into a tape recorder, video tape it: etc.) then 
whatever you created is copyrighted as soon as it is “fixed.” If you are writing the great 
Amencan novel on your word Processor, then the novel is being copynghted as you enter it 
into your computer Computer entry, even into the computer's temporary memory, is usually 
considered "fixed" enough for copyright purposes, and most definitely is “fixed” when it is 
Saved to a hard disk. If you like to write with a pencil, wnatever you write is being copyrighted 
as It is being written—writing with a pencil is a kind of “fixation.” 


Of course, if you are copying something created by someone else, then the “something” you 


are copying is not original to you. Hence you would not be creating a copyrighted work in that 
circumstance 


So what do people contact the Copyright Office for? 


You may, if you like, register a copyright with the Copyright Office. That requires a two-page 
form and (currently) a $20 fee. But registering a copynght is not the same thing as having a 
copynght. Remember you have a copynght whenever you fx an onginal work of 
authorship—write it down, record it, etc. Registration is handy for a few reasons. though far 
from essential. If you think you might ever sue anyone for copyright infringement, registenng 
early offers some advantages in the amount of damages you might be able to collect. 


What is the meaning of the little phrase: “© 1997 M. Smith”? 


The expression “© 1987 M. Smith” or its equivalent “Copyright 19987 M. Smith” or “Copr. 
1997 M. Smith,” is called the “copyright notice.” At one time, you were required to put such 
a notice on any work you wanted to protect whenever you published the work. Others who 
saw a published work without such a notice were entitled to treat the work as being in the 
public domain. 


HARDY 


COPYRIGHT AND ARCHIVES PAGE 4 


The original thinking behind the requirement was to put others “on notice” that you intended 
to preserve your copyright rights. In other words, our law carried a presumption that a 
published work was not copyrighted unless it carried the copyright notice. 


Since 1989, however, our laws no longer require a copyright notice. At that time we reversed 
the presumption: the current presumption is that any work, published or not, is copyrighted 
by its author, unless the author indicates the contrary. Sometimes people do indicate the 
contrary by noting “This article may be freely copied,” or ‘this article may be copied and 
distributed for educational purposes, as long as attribution is given to the author and a fee no 
higher than the cost of copying is charged,” or “This program is shareware: it may be copied 
without limitation, but if you use it for more than 30 days you are required to pay the author 
$25," etc. 


But in the absence of any such an indication, one must presume that if a work is the kind of 
thing that can be copyrighted. it is copyrighted. 


Among other interesting consequences of this rule is the fact that every single (non-tnvial) 
piece of e-mail sent over the intemet is copyrighted unless it says otherwise. Ditto with Web 
pages, Usenet newsgroup postings, and the like. 


What Rights Does a Copyright Owner Have? 


Copyright owners have five basic rights and a sith extraordinarily complicated right: 
1. to reproduce the work: 

2. to prepare derivative works based on the work: 

3. to distribute copies of the work to the public; 


4. to perform the work publicly (if it is the kind of thing that can be “performed,” like a song 
or a play or a movie, and if it is not a “sound recording’—more on this complicated 
subject appears below) 


5. to display the work publicly; and 


6. ff it is a “sound recording,” to perform the work publicly by means of a “digital audio 
transmission.” 


“Reproduce” means “copy.” The term applies to almost any instance of copying, including 
the making of a single copy of an article from a magazine by photocopying it, or dawnloading 
pages from a Web site, or scanning an image into digital form. Some of these activities do 
not necessaniy infringe the copyright owner's rights because they might be considered a “fair 
use” of the work (discussed below). But they all constitute “reproduction” and hence at least 
raise the question of whether they infringe any copyrights. 


“Derivative works’ includes anything that is “based upon one or more preexisting works, 
such as a translation, musical arrangement, dramatization, fictionalization, motion picture 
version, ... or any other form in which a work may be recast, transformed, or adapted.” 
When a movie company wants to make a movie version of a book, the company must get 
the book author's permission because otherwise the company would violate the author's 
right to make defvative works. Similarly, anyone seeking to translate a book into another 


HARDY 


COPYRIGHT AND ARCHIVES PAGE 5 


language must also seek the author's permission or else inf the author's right to make 
derivative works se ne 


Authors have the right to say when and how their works will be ‘distributed to the public.” 
Obviously, this right was written into the law when the principal means of making money was 
to produce multiple copies of a work and “distribute” them (ie., either sell, or rent, or lend, 
etc.) to the public. This continues to make sense when we think about books, CD's, video 
cassettes, and other tangible embodiments of copyrighted works. 


It makes less sense when we think about information that is “posted” toa Web site. This sort 
of thing obviously wasn't contemplated when the current Copyright Act was adopted in 
1976. Technically, the person doing the posting is not realty “distributing” anything because 
they are just making a copy onto a Web-accessible machine. Moreover, the poster is not 
really making “copies” at all—f anything, it would be those who visit the site wno make the 
“copies.” So that leaves a bit of a question as to whether “posting” to a Web site constitutes 
“public distribution of copies.” 


The few indications we have so far are that courts will very likely interpret the Copyright Act's 
language so that “posting” would indeed constitute “public distribution.” If so, that would 
mean that posting another's copyrighted work on a Web site without permission would be 
an infringement of the copyright owner's right of “public distribution.” 


Posting on dorm room walls, on the Web 


Among other things, this is counter-intuitive to many people. It is fairly common today for. 
Say, a student to cut out a poster or picture of a movie star or Star Trek episode or similar 
thing and post it on a college dorm wall. Understandably, when students also have a “home 
page” on the Web, they think of doing the same thing there: getting a digital copy of a poster 
or picture and posting it on their Web site. They often want to know ‘What's the difference? If 
it's posted on my dorm wall, or posted on my Web site—wno cares?” 


Again. we do not have clear answers as to how courts would handle this situation, but my 
own instincts are that there are significant differences. Posting a physical copy of a picture on 
a dorm-room wall would be considered a “display” of the picture, but it would not be a public 
display. The copyright owner can only object to public displays, so there would be no 
infringement in the dorm-room case. Posting something to a Web site, however, as noted 
above, is likely to be seen as a “distribution” of the work. or perhaps a “display” of the picture. 
Either way, it would almost certainly be considered a public distribution or display, and hence 
it would be an infringement of the copyright owner's rights. 


Public performances of music and sound recordings 


Perhaps the most counter-intuitive right of copyright owners is to object to the “public 
performance’ of their works. “Perform” orginally meant what most people think it means: 
putting on a play, for example, or singing a song. But over the years it has acquired a more 
technical meaning. It naw means something like “making use of a work that unfolds through 
time." 

For example, playing a CD on a CD machine is a “performance” of the works on the CD. 
Playing a video tape on a VCR is a performance of the movie (as is showing the movie in a 
movie theater, of course). Even if the “performing” is, as it is in these cases, only being done 
by a machine, it is still defined as a “performance” for copyright purposes. BUT: remember 
that the only right a copyright owner has is to object to public performances. That is why 


HARDY COPYRIGHT AND ARCHIVES PAGE 6 


playing a CD or video tape at home does not violate the copyright owner's rights—playing 
these things at home is a “performance” for copyright purposes, but it is a private, not a 
public, performance, and so it is not an infringement of anybody's rights. 


Sound recordings: not the same as “musical compositions” 


Another thoroughly non-intuitive feature of authors’ rights is that although authors have a 
general right to object to public performances of their works, they do not have that right with 


respect to “sound recordings’ specifically. 


The typical “record” of music, whether on LP, tape, CD, or whatever, actually carries two 
copyrights. One is on the musical composition—the song, plus melody—and a second is on 
the way the particular sounds are preserved on the record—the arrangement, tempo. 
dynamics, harmonics, tonal variations, sound “color,” and so on. if you look at the label on 
CDs and tapes, you will often see two copyright “notices,” one with a “c” in a circle, like “©.” 
and the other with a “p’ in a circle (for which my word processor does not have the matching 
symbol). The “c" notice is for the musical composition (song, tune, etc.). The “p” notice is for 
the sound recording (and stands for “phono-record’). 


Let's take an example of how that works out in practice. Suppose that “discotheques” or 
“discos” still exist (| have no idea whether they do or not). At a given disco, the propnetor 
plays a CD of 2-Live Crew singing “Margueritaville” by composer Jimmy Buffet. The disco is 
open to the public. 


Are any rights infringed? Yes: Buffet is the owner of the copyright on the musicai 
composition “Marganteville,” and Buffet has the nght to object if his song is publicly 
performed without his permission. (In practice, he would not “object” but rather seek 4 
payment of royalties.) A disco open to the public is a public place; the playing of a CD is a 
“performance;” and the playing of a CD at a place open to the public is defined in the law as 
a “public performance.” Buffet can complain if he is not being compensated. 


But there is a second copyright on this CD: it is on the way 2-Live Crew have created ana 
recorded their particular performance. This is the “sound recording" copyright, as distinct 
from the “composition’ copyright. Let us assume that 2-Live Crew is the owner of the 
copyright in the particular way they perform the song on the CD. They would own a separate 
copyright in the “sound recording” embodied by the CD. Can they also object to the discos 
publicly performing their sound recording? No, they cannot: owners of sound recording 
copyrights do not have the right to object to public performances. So the disco owner owes 
nathing to 2-Live-Crew. 


if all this sounds inconsistent, it is: it is nonetheless the way the law is written. 
Sound recordings and digital transmissions 


The copyright owner's right to abject to the public performance of a “sound recording” by 
means of a “digital audio transmission’ is a little different, however. 


What was intended in this provision was that digital transmissions, which permit exact 
reproduction, should be disallowed if the transmitter is making money from the particular 
transmission. The way that general concem is reflected in the Copyright Act is that owners of 
copyright in sound recordings may object to the digital transmission, through an 
interactive service, for which a subscription payment is made. Each of these elements is 


HARDY 


7 
COPYRIGHT AND ARCHIVES Ree 


more or less defined in the Act, but the definitions and the provision overall are extraordinanly 
complicated. 


Quite clearly, one thing not allowed (at least not without the copyright owner's permission) 
would be an Intemet service that, for a monthly fee, allowed customers to choose the songs 
they wanted, and have those songs “played” over the Internet in digital format. 


What is less clear is whether an archive that charged a monthly or annual fee for 
researchers, and that included digital versions of music, would be allowed to provide its 
services without permission. Owners of copyright in the musical compositions car 
certainly object to (or charge for) this practice. The question is whether the owner of copyright 
in the associated sound recordings could object or not. The provision in the law about 
digital transmissions is so new that we just do not know how it is going to be interpreted. 


A big exception to the nghts of authors is the concept of “fair use.” Essentially this concept 
says that some uses of copyrighted works are deemed by the law to be either of so little 
harm to the copyright owner, or of so great a benefit to others, that the use is “fair” and the 
user cannot be sued for infingement. 


Note that “fair use" does not come into the picture unless a work is otherwise copynghted. 
and the use would otherwise be a violation of the owner's rights. That means that making 
copies of a work that is in the public domain is perfectly legitimate, not because it is a “fair 
use.” but rather because the work is not copyrighted to start with, so we don't need to reach 
the question of fair use. Similarly, if one plays a CD at home, we don't need to reach the 
question of fair use because a private performance is not a violation of the owner's rights in 
the first place. 


Fair use is designed to be “context-sensitive.* That is, it tums on the circumstances of each 
particular case. That means that it is both flexible (that's goad), and also hard to predict (not 
$0 good). Courts look at four “factors” in any fair use case and try to draw an overall 
assessment of the situation. The factors are: 


1. Purpose and character of the use (To make money? To further scholarship”) 


2. Nature of the copyrighted work (Highly creative fiction? A list of company inventory 
numbers?) 


3. Amount used (Did you copy or perform the whole thing? Just a tiny excerpt?) 


4. Effect on the market (Will the use, if it were to become wide-spread and no royalties 
were paid, deprive the owner of significant revenue”) 


Many people think that “educational’ purposes are automatically a fair use, but there s 
nothing automatic about it. Non-profit, educational uses are certainly in the “favored” 
category under “purpose and use,” but one must still look at the other factors. For example. 
copying an entire textbook for a class to save the money of buying textbooks would almost 
certainly not be a fair use, largely because the “amount used” would be the entire work, and 
the negative effect on the market—here, the textbook market—would be "Very substantial.” 


On the other hand, copying a newspaper article from yesterday's paper to hand out to 
today's class in political science would almost certainly be a fair use. Although the amount 
taken is “all of the article,” the purpose is educational; the nature of the work is fairly factual 


HARDY COPYRIGHT AND ARCHIVES PAGE 8 


(the more factual a work, the less it is protected); the effect of deeming such copying to be a 
fair use on sales of the newspaper is negligible; and the effect of deeming it a fair use on the 
market for licensing spontaneous classroom copying is also negligible. 


What about archives generally? 


Now we are in a position to talk about archives and copyright. What might be the issues? 
Well, to start with, “archives” are explicitly mentioned several times in the Copyright Act 
Section 108, for example, is a fairly lengthy provision dealing specifically with “libraries and 
archives.” The provision was adopted primarily for what we think of as libraries and archives 
in the physical sense: a public library of books and magazines, a university library of books 
and scholarly journals, etc. And, as is so common with the Copyright Act, the rights anc 
responsibilities defined in section 108 are primarily aimed at just those sorts of tangible 
materials: books, journals, and magazines. So the application of this section to an electronic 
archive of Internet materials is an exercise in interpretation and a bit of speculation. 


We can start with what section 108 says and what it means for physical libraries and 
archives, but please remember this: as | discuss the rights that copyright owners’ have 
and don't have, | am talking about the situation that obtains in the absence of agreement or 
permission otherwise. |n other words, when | say “an archive is allowed to do this but not 
allowed to do that, | mean “in the absence of permission from the copyright owner. an 
archive is allowed ...” etc. If copyright owners give permission for something, then that 


changes everything. 
For example, authors of books have an absolute right to prevent the reproduction and 
distribution of their books, but most are willing to waive that right in order to get a publisher to 


reproduce and distribute their books. So although it's true that “you are not allowed to copy 
and distribute an author's books,” authors routinely give permission for exactly that if they 


want their matenal to be read by anybody. 
Section 108’s rules 


Libraries and archives (hereafter I'll just say “archive") can make copies of materials and give 
the copy to patrons under certain conditions, for certain, defined, purposes. Five conditions 


apply to all such copying, regardless of purpose: 


(1) that no more than one copy be made (at a time, for a given patron), so that the copying 
never becomes “systematic” or a substitute for regular subcriptions or purchases; 


(2) that the capying is of textual materials and not of music, pictures, movies, other audio- 
visual works, and the like (except for TV news programs); 


(3) that the archive not be making commercial advantage of the copying; 


(4) that the archive is one either open to the public generally, or at least open to researchers 
besides just those working for the institution that operates the archive; and 


(5) that any copies carry a “notice” of copyright, e.g., “© 1997 J. Doe.” 


HARDY COPYRIGHT AND ARCHIVES raw? 


if those five essential conditions are satisfied, then any one of a number of additional, defined 
Purposes may be used to justify the copying. Many of these purposes are defined in more 
detail in the statute than | will summarize here, but this will convey the idea. 


Archive copying for patrons that meets the above five conditions is okay if it is also either: 
(@) made as a facsimile copy of an unpublished work for purposes of preservatian; or 


(b) made as a facsimile copy to replace another copy that was lost or stolen, and that 
cannot otherwise be obtained at a “fair price:” or 


(c) made of an excerpt of something larger, typically a single article from a journal or a 


single chapter from a book, and the patron intends to use the copy for study or 
research; or 


(d)_ made of an entire work like a book, if made for the patron's study or research and a copy 
cannot otherwise be obtained at a fair price. 


Notice that all these rules apply to an archive delivering something to a patron. Although the 
Statute does not say so explicitly, almost certainly an implicit requirement of section 108 is 
that the archive or archive originally must have obtained its own copies through lawiul 
means. That is, an archive that was created by stealing all the books from Bames and Noble 


would not be allowed to claim that it was authorized to make copies for patrons under this 
section. 


What about online archives of digital materials specifically? 


With online, electronic archives, two new issues arise. First, how does section 108 apply to 
digital materials that are only available on site? Second, how does it apply to an online 
archive that is accessible remotely, say aver the Internet? 


In practice, these two elements are likely to coalesce. The most likely electronic archive 
would both be of digital materials and also be remotely accessible. But it will help in thinking 
through the questions to take things one step at a time. So let us first imagine an ‘ordinary’ 
archive that wants to create digitized versions of works now in print for some of the purposes 
spelled out in section 108. May it do so? 


Physical archives 


First of all, section 108 applies primarily to textual materials—audio-visual works like 
movies, as well as musical compositions and images. are simply not within the section's 
purview. Therefore, copying printed images, sheet music or CD's, video-taped movies, and 
the like is off limits even for reproduction in similar media, and would certainly be off limits if 
converted to electronic format. 


For textual materials, recall that section 108 authorizes four particular types of archive 
copying for patrons: for preservation, for replacement of lost or stolen copies, in small 
amounts as with an article from a book, and of entire works if they cannot otherwise be 
bought. The first two of these provisions, the ones conceming preservation and replacement, 
are qualified by a requirement that the copies be in “facsimile form.” What does that 
mean? 


HARDY 


COPYRIGHT AND ARCHIVES PAGE 10 


The one thing we know it means is that when “facsimile copies” are authorized, that 
authorization does not include the creation of full-text searchable electronic versions of the 
materials. Clearly, paper-based photocopying was envisioned and authorized. Just as 
Clearly, OCR scanning into a text database was also envisioned and not authorized. 
Unfortunately, What is not clear is this: whether a “facsimile copy’ can be an ‘electronic 
facsimile” in the form of a scanned image of a page of text that is left as an image and not 
converted into searchable text. 


My own guess is that it would be allowed if patrons could not themselves convert the image 
into searchable text. Since such a conversion is, however, presently quite possible (OCR 
software, and fax software that includes OCR capability, do exactly this), my conclusion is 
that section 108 does not authorize archives to give out copies of printed textual materials in 
any digital format at all. 


If the original archive copy is itself digital, however, then a digital copy would presumably be a 
“facsimile copy” and would be authorized. For example, if an archive has bought two copies 
of a CD-ROM encyclopedia that contains only text (or images that merely illustrate the text), 
and one of the CD-ROMs has been stolen, and an unused replacement cannot be obtained 
. i price, then the archive would be authorized to make a copy of the remaining CD- 


Online libraries and archives 


Excerpts 


Now let us shift to online libraries and archives. What changes, if anything, when an archive 
no longer exists in physical space but is a resource on the Intemet? 


The biggest change this makes is that an online archive is not likely to be giving out tangible 
“copies” of anything to patrons. More likely it will provide remote access to its database and 
allow online search and retrieval. Right away, that raises the question whether section 108 
applies at all, since the section is directed toward “reproducing copies” and “distributing 
copies,” neither of which seems to have much to do with “online searching and browsing.” 


There is much to be said on this very issue, but to get to the point: | think that a court faced 
with the issue would conclude that allowing online access does constitute the making 
and distributing of a copy of the work in question. This conclusion is far from certain, but 


| think it likely. 


The parts of section 108 that apply to an archive preserving its own works, or replacing works 
lost or stolen, are thus inapplicable to the situation of access by others. The question left is 
whether an archive can allow patrons access to its materials under the sections of 108 that 
deal with the copying of excerpts for research, or the copying of whole works for 
research when copies cannot otherwise be bought for a fair price. 


Remember that browsing materials online may be considered by courts to constitute the 
making of a copy of the materials. If that eventually proves not to be the case, and no other 
equivalent amendments are made to the Copyright Act, then what follows will no longer be of 
concem to archives. So what follows is based on the assumption that “browsing” in fact 


equals “copying” for copyright purposes. 


The first provision, allowing the archive to copy and distribute excerpts for research, might 
well be applicable to an online digital archive: a patron browsing the archive might end up 


HARDY 


PAGE 11 
COPYRIGHT AND ARCHIVES 


reading only an excerpt of another work, and be doing so for research purposes. If that -* 
true, then the archive is safe under section 108. (But don't forget that section 108 does Tv 
apply to non-textual material at all, so an archive that allows patrons online browsing © 
images, sounds, video, and so on would not be safe under section 108, though it might 
safe under the fair use provision). 


when a user makes a “hit’ in @ 


excerpt” clause would no longer 
ingement claim 


Of course, if the archive features full-text retrieval such that 
search, it results in an entire work being displayed, then the " A 
apply. In that case, the archive could not rely on section 108 to avoid an infri 
by the copyright owner. 


Whole texts 


The second 108 provision, that for copying whole works when needed for research ana 
when copies cannot otherwise be bought, similarly may or may not be applicable. This 
provision would offer the strongest safe harbor for an archive that preserved ‘out of print 
materials. That was what the clause about “not otherwise available at a fair price” was 
designed for. 


In the Intemet context, one supposes that the analogy would be to “digital, textual materials 
no longer available anywhere." But again, even this provision applies to textual materials— 
not to images, sounds, and so on. 


Copynght is complicated. Although the essential rules can be briefly stated, the exceptiors 
and permutations to the rules depend on context, on purpose, on the nature of the 
copynghted materials, and so on. Many of the rules were designed for the print age, such as 
those that allow libraries and archives to make single copies of excerpts of works, or entire 
works that are out of pnnt. These rules only uncomfortably fit the concept of an online digital 
library or archive. They require at best thoughtful interpretation to apply them to the digital 
context, and at worst, educated guesswork about how a court would likely apply them. 


As a result, one cannot easily or concisely summarize the copyright issues involved in 
archiving. At the risk of greatly oversimplifying matters: online archives can do little collecting 
and assembly of information without permission, and even less distnbution without 
permission. If current court interpretations of the copyright term “copy” persist, then even 
allowing users to browse an online archive infringes the nights of the owners of copyright in 
the archived materials. 


With permission from the relevant copyright owners, of course. anything can be done. but it 
appears that a substantial amount of permissions will be needed under our current law 


before much of anything can be done. 


PRIVACY AND THE DIGITAL ARCHIVE: 
OUTLINING KEY ISSUES 


Marc Rotenberg, director 
Electronic Privacy Information Center 
Washington, DC 
www.epic.org 


DOCUMENTING THE DIGITAL AGE 
San Francisco 
February 1997 


At first glance, a privacy advocate among digital archivists is a none too popular  asaieal 
Privacy concerns often create obstacles to access, restrict the collection of useful information, Or 
raise public concerns about the merits of the underlying endeavor. A recent article on problems 
with the IRS record-keeping went so far as to to blame privacy protection for the failure to 
construct a information management system that could process the nation’s tax returns. 


It is not the goal of this paper to argue that privacy protection does not at times impose 
costs. Instead, the purpose is to provide an overview of privacy as a legal and sociological ; 
concept, to describe several of the privacy issues that arise in the digital environment, to identify 
some of the critical issues, and also to encourage new thinking about the relationship of privacy 
and efforts to promote access to public information. 


Privacy will be an issue in the development of an Internet archive. An examination of key 
issues will help clarify underlying problems and may help resolve some of the privacy concerns. 


PAST AS PROLOGUE 


In 1987 Judge Robert Bork was nominated to be an associate justice for the United States 
Supreme Court. The confirmation hearing was hotly contested. Supporters of Judge Bork 
pointed to his distinguished professional career as a law school professor, solicitor general, and 
appellate judge. Opponents questioned his views on such issues as civil rights, abortion, and anti- 
trust. Judge Bork’s scholarly articles were debated. His judicial opinions were dissected. The 


nominee of a Republican President was questioned at length by the Democratic chair of the 
committee. 


Then in the midst of the confirmation battle, a reporter obtained a list of the video rental 
records for the Bork family from a local video store. The reporter published an article in the City 
Paper on the private viewing preferences of the Supreme Court nominee. Suddenly the public 
debate on the nomination turned from the judge’s scholarship to his choice of John Wayne and 
James Bond movies. Eventually, and for other reasons, Judge Bork was not confirmed. A few 


months after the debate concluded, Congress enacted a law, the Video Privacy Protection Act, to 
limit the disclosure of video rental records. 


While the disclosure of Judge Bork’s video viewing habits played little role in the ultimate 
decision of the Judiciary committee, the incident and the subsequent debate about the release of 
video records provided the country a concrete example of privacy issues in the digital age. 


Imagine a judicial hearing in the year 2007. The nominee comes before the Judiciary 


Committee. Should every transaction that links her name to a record on the web be available for 
public scrutiny? 


y 


= — 2 


WHAT IS PRIVACY? 


More than a century, Louis Brandeis described privacy as the “right to be let alone.” It 
was a sweeping phrase that is still today often cited by courts and commentators. It is also an 
incomplete definition. As a legal concept, Dean Prosser said that there were in fact four distinct 
privacy interests -- a right to protect the disclosure of private facts, a right to prevent intrusion 
into seclusion, a right to prevent the presentation of one’s self in a false light, and a right to 
prevent commercial appropriation of personality. Scholars have said that privacy is also a right 
of autonomy, of dignity, of freedom in personal decision-making, and of association. 


There is also the privacy right derived from the Fourth Amendment which limits the 
power of the state to conduct searches and intrude into private life. It is the legal basis for the 
requirement that officers of the state typically must obtain a warrant before they can enter your 
home or seize your papers or possessions. 

These are general descriptions of privacy. In practice, privacy is often found in law as 
specific limitations on the use of personal information. Many privacy laws today are codified in 
statute, such as the Fair Credit Reporting Act or the Video Privacy Privacy Protection Act 
described above. Courts also continue to interpret the Constitutional restrictions on government 
searches and the common law concept of privacy set out by Brandeis. Privacy laws are therefore 
a combination of rules enacted by legislatures and rulings of judges. 


As a general matter, privacy laws have several common attributes: 
* Privacy is fundamentally a claim of individuals 
¢ Privacy concerns personally identifiable information 


* Privacy issues arise in the collection, use, maintenance or disclosure of personally 
identifiable information 


* Privacy claims are given greatest weight in relations of trust 


° Privacy issues are universal 


The Internet raises the prospect that new privacy issues will arise. Far more information 
on individuals can be collected. Inferences about individual behavior could go much further. 


Key issue: Are currently understood concepts of privacy adequate to understand the 
issues that a digital archive would raise? 


N 


WHY PRIVACY MATTERS 


The protection of privacy is an article of faith among Internet users. This can be seen 1n 
the political activity of net activists on such as issues as Clipper, Digital Telephony, and P-Trak. 
It is reflected in the ongoing effort of organizations to develop privacy policies. It is also clear 
from public polling data, including a recent poll from Lou Harris, that the general public is 
concemed about privacy. Most notably, the recent GVU WWW survey also found privacy 
among the top concerns of Internet users around the globe. [http://www.epic.org/privacy/] 


In general, users of the Internet are more concerned about privacy than the general public. 
They value anonymity, oppose the sale of personal information, and favor techniques for 


anonymous payment. They are reluctant to disclose personal information if they do not know 
how it will be used. 


Perhaps this is not surprising. The Internet itself has promoted a wide array of 
anonymous and pseudo-anonymous activity -- the use of aliases, anonymous FTP, digital 
payment systems, anonymous surfing. The disclosure of personal information on the net is often 


a negotiation based on a person’s assessment of the benefits. Some users turn away from web 
sites that require registration. 


Is there an empirical basis for the concern about privacy? The answer is yes. At one end 
of the spectrum are the intrusions resulting from unwanted email, a minor convenience. At the 
other end are moments in history when governments have misused records on citizens. The 
United States used records of the census bureau to identify population tracts of Japanese- 
American families so that they could be interned during the Second World War. Nazi Germany 
made use of administrative records in municipal offices and telephone company records to 
identify Jewish families throughout Europe. Across the spectrum are public concerns about the 


loss of employment, insurance, or medical care, resulting from the improper disclosure or misuse 
of personal information. 


Key Issue: Are public concerns about privacy issues likely to increase or decrease 
in the years ahead? What would be the consequences of either outcome for an 
effort to archive the web? 


PRIVACY AS AN INSTRUMENTAL VALUE 


Privacy is often described as human right or a political right. Privacy also plays an 
important role in information environments. Communications and information retrieval are two of 


the best examples. 


When Benjamin Franklin set out plans for the US postal system, he knew that privacy 
would be important. One of the first laws governing the operation of the US mails made it a crime 
to improperly disclose private correspondence. Two centuries later, when Congress considered 
the scope of the federal wiretap law in new digital environments, it recognized a need to extend 


privacy protection to electronic mail. Communications from one person to another often contain 
the most personal thoughts. 


Privacy also plays an important role in the realm of information retrieval, particularly for 
sensitive or controversial subjects. Web services and telephone hotlines that provide counseling 
on AIDS, gay and lesbian issues, suicide and depression routinely provide techniques for 


anonymous communication. Anonymity is necessary to encourage potential users to obtain the 
information they are seeking. 


More generally, libraries protect the privacy of borrower records. Many libraries also 
routinely destroy borrower records when library materials are returned. Such policies were 


developed to encourage users of libraries to explore unpopular topics without fear of public 
recrimination. 


A clear analogy arises in the online with user logs. Many web operators are reluctant to 
disclose the content of these logs to others, and some even have a policy of destroying logs. Still, 


web operators can use aggregate data to track site usage and even to obtain advertisers without 
compromising the privacy of users. 


Key issue: Should private communications be routinely archived? Should acts of 
information retrieval be routinely archived? 


PUBLIC LIFE AND PRIVATE LIFE 


Privacy can also be understood as the interaction of two distinct social environments -- 
the public realm and the private realm. The German philosopher Jurgen Habermas described a 
dynamic relationship between the private sphere and a public sphere. Social life and political 
activity occurred in the public sphere. Private life is distinct and occurred in a separate sphere. 
Individuals moved between the two spheres at different times for different reasons. In a similar 
light, Erving Goffman, the Berkeley sociologist, observed that in moving from a private realm to a 
public realm, individuals formed bonds of trust. These bonds of trust were based on the 
disclosure of aspects of personality that might occur in one relationship but not another. A 
person might share a secret with a colleague but not with a stranger. 


The web provides an excellent example of the operation of a public sphere and a air 
sphere. Web pages are a part of the public sphere and reflect a wide range of social, commerciat, 
and political activity. The private sphere includes email and intranets, communications that are 
personal and organizational relations that are closed. Arguably, the crossing point between the 
private sphere and the public sphere can be found in web usage logs. It could well be the case that 
the freedom to move between a private sphere and a public sphere on the web will be determined 
by the privacy of these records. 


Key issue: Does the creation of an Internet archive threaten the preservation of a 
private sphere on the Internet? 


PRIVATE LIFE AND SURVEILLANCE 


Intrusions into private life can be justified in some circumstances. A doctor conducts a 
physical examination of a patient to ensure an accurate diagnosis. An employer inquires into 
previous employment of a potential employee to determine if a the person is suitable for a 
particular job. A bank collects credit information on a person requesting a loan to decide if the 
person will be able to repay a loan. Law enforcement officials search a person’s home or place of 
business if there is probable cause to believe that a crime has been committed. 


But routine intrusions into private life are more problematic. Few people would consent 
to ongoing video surveillance, even in a public place, or the routine disclosure of personal 
financial records to others. 


Consider how Justice Brandeis viewed the emergence of the first collection of electronic 
information in the law enforcement context, what I like to call the first cyberspace opinion. The 
question before the Supreme Court in 1928 was whether the practice of intercepting telephone 
communications should be subject to the probable cause requirements of the Fourth Amendment. 
The government argued that since there was no physical intrusion into the defendant’s home as 
there was no physical search. 


Brandeis was not persuaded. He compared the interception of a telephone communication 
with the interception of a letter sent by the US postal service. Brandeis noted that this type of 
search was not limited in space or time. In a physical search, for example, the officer had a 
warrant for a specific bit of evidence. The search occurred at a fixed point in time, and there was 
also notice to the target. Wiretapping was ongoing and could capture the communications of 
innocent parties. 


Modern law on wiretapping generally reflects Brandeis’s concern by limiting the duration, 
purpose, and scope of electronic surveillance. 


Looking at the problem of surveillance from a different perspective, Jeremy Bentham, the 


eid century utilitarian philosopher, designed plans for the perfect prison which he called 
€ panoptican. In such a prison, it would be possible to constantly monitor a prisoner. Bentham 


speculated that in such an environment simply the belief that one was being watched would be 
enough to coerce behavior. 


Surveillance, as a routine intrusions in private life, is one of the common themes of many 
dystopias such as 1984. There is rarely a private sphere, or it is enjoyed by only a small elite. 


O Key Issue: Is it possible to develop archiving practices that do not involve ongoing 
surveillance? 
PRIVATE LIFE IS PART OF HISTORY 
Having said that privacy is a fundamental concern Internet, it would nonetheless be 
foolish not to recognize that private life is also very much a part of the historical record, and that 
one could not properly understood a culture or a period in time without some insight into private 
activities. A few examples of popular collections 


* Walker Evans photographs of sharecroppers during the depression 


* Edwards Steichen’s collection The Family of Man in the 1950s which helped the world 
see all the beauty and diversity of people, places, cultures, and traditions 


¢ Robert Coles interviews with young children 


Some mechanisms should be developed to capture “snapshots” of private life on he 
Internet -- particular emails messages, even particular web logs. 


Key issue: How should information about private life be collected and how should it be 
presented in a web archive? 


THE PROBLEM OF BALANCE 


When faced with competing policy interests, it is often tempting to invoke a call for 
balance. Apart from its rhetorical value, “balance” grants both claims some legitimacy and avoids 
the hard work of making difficult choices. 


But balance is often not the best way to think about privacy issues. This is not becaliss 
that there are not competing interests. Rather it is the case policy, like the technology it BAEORs 
is highly dynamic. The range of policy choices that exist next year to reconcile competing 
privacy and public access claims may be very different from the choices available today. 


Let me borrow a device from the economists to help make this point. Economists tell us 
that given two things we value, carrots and cabbage for example, we can often plot our trade off 
of one good for the other on an indifference curve. In the policy realm, we might also make 
choices between two policy goals -- privacy protection and public access. Looking at a fixed 
range of policy options (PO), we might choose PO* which gives us a iot of public access at the 
expense of privacy protection. Or we might choose PO? which provides a high level of privacy 
protection but at a sacrifice in public access. 


(Figure 1) 


Some problems in information policy involve making choices between PO*and PO? which 
necessarily pit privacy interests access issues of public access. But in a surprising number of 
cases, better policies are found when it is possible to move the indifference curve to a position 
that increases both privacy protection and public access. 


(Figure 2) 
What are examples of “moving the curve” to a better range of policy options? 


+ In 1974, the United States Congress passed the Privacy Act and expanded the scope of 
the Freedom of Information Act. In this way, the privacy of personal records held by the 
federal government was protected, while public access to public records held by the 
government was enhanced. 


¢ In a recent case, the Supreme Court ruled that the publication of anonymous pamphlets 
was protected by the First Amendment. In so holding, the Court both expanded the 
privacy rights of authors to hide their identity and promoted the free flow of information 
by limiting the ability of state governments to restrict the publication of information. 


* In libraries, individuals are typically free to obtain a wide range of materials without any 
recording of their interests. These practices encourage access and protect privacy. 


Conversely, it is also possible to imagine a world where both privacy protection and 
public access is limited. Prisons, totalitarian governments -- and notably the world of Brother 
Francis -- are societies where public access to information is limited, but so too is privacy 
protection. There is no private sphere in these worlds. And the absent of a secure private sphere 


7 


is reflected the absence of a robust public spheres. 


Key issue: How do we construct a policy for an Internet archive that increases both 
public access and privacy protection? 


INFORMATION POLICY AND OPEN SOCIETIES 


Open societies and democratic societies tend to formalize the protection of public sphere 
as well as the private sphere, to move the curve outward to PO,,- Such societies establish 
procedures that assure the preservation of government records, the privacy of individual records, 
and access to public information. A quick survey of such laws in the United States include 


* Privacy Act (private records held by government agencies) 

* Freedom of Information Act (public access to government records) 
¢ Records Preservation Acts 

+ Federal Advisory Committee Act (open meetings) 

* Depository Library Program 


Key issues: How would such laws apply to an Internet archive? Will an Internet archive 
reflect a similar commitment to protecting the public and private spheres? 


CHALLENGE OF ANONYMITY 


Many of the current efforts to protect privacy on the Internet are based on promoting 
anonymity and pseudo-anonymity. A new German ISP law requires the availability of 
anonymous payment mechanisms for commercial providers offering goods and services over the 
Internet. In response to public concern that cookies could be used to transfer personally 
identifiable information, companies such as PGP Inc. have developed a program called “Cookie 
Cutter”. Other efforts include the development of digital cash and the destruction of usage logs. 


Key issue: Will archivists respect the efforts to promote systems and techniques for 
anonymity or will there be pressure to personally identify information or preserve 
information that might otherwise be destroyed? 


One area where the goals of archivists may face trouble with traditional privacy rules 
ations that collect personal information to protect the privacy 


concerns the obligation of organiz 
les are commonly referred to as a Code of Fair Information 


interests of data subjects. Theses ru 
Practices and could be simple c 


odes of conduct or formal rules of law. 


A Code of Fair Information practice typically restricts the ability of an information 
collector to disclosure personal data to others. A code also grants a right to data subject to 
inspect and correct personal information. A code may even include an obligation to destroy 
information about individuals once a certain period of time has passed. 


Key Issue: Should an Internet archive be subject to any form of a Code of Fair 
Information Practices? 


EPIC AS EXAMPLE 


The Electronic Privacy Information Center follows an information policy that 
protects privacy and promotes public access. We make every effort to protect the 
privacy of our users -- we destroy our logs, we do not disclose our records to others, we 
are trying to support systems for anonymous payment, and we make a wide range of 
privacy enhancing tools available at our web site. 


But EPIC is also very much interested in preserving and publishing public 
information of interests to the general public. One of our critical goals is to make available 
policy documents obtained from the government regarding the development of 
cryptography policy. To accomplish this task we have pursued Freedom of Information 
Act litigation, scanned images of documents, and archived records at our web site. 


Some of these documents could be crucial for understanding critical policy choices 
made in the early stages of the development of the Internet. Consider for example these 


items: 


¢ A memo from Brent Scowcroft to George Bush describing Digital Telephony as a 
“beachhead” for Clipper 


¢ A memo from the FBI to the National Security Counsel indicating that Clipper 
will only work if made mandatory 


Key issue: Could policies that protect privacy and promote public access be adopted by 
an Internet archive? 


THE CASE FOR OPENNESS 


The creation of an Internet archive raises a series of privacy issues, but it is best that 
these issues are debated publicly and that privacy concerns are addressed. What if a large private 


9 


corporation or a large government agency undertook a similar effort to collect everything on the 


net, but without any opportunity for public discussion or any public awareness of the collection 
activity? 


In beginning a discussion on the privacy implications of an Internet archive, an important 
step has been taken toward the development of policies consistent with open societies. 
Successful resolution of these issues could help avoid the establishment of archives that reflect 
the values of closed societies -- little public access, little privacy protection. 


Key issue: Assuming that it is possible to develop privacy appropriate standards for an 
Internet Archive, should Steps be taken against other entities that attempt a similar effort 
without appropriate safeguards. 


GOALS 


From the privacy Perspective, the central goal for documenting the digital age should be to 
build a public archive that respects private life. This means recognizing that there are aspects of 
private life that should not be recorded. It also means ensuring that the content of the archive 
should be made widely available and not held by a small elite. 


Such an archive would not routinely record private activity on the web, but would do so 
selectively. Boundaries on disclosure would be based on promoting techniques to preserve 


anonymity, limitations on collection and disclosure, and recognizing that private spheres exist on 
the net and should be preserved. 


How do we know where private life begins? One answer may be provided by other 
experiences with past technologies for recording our history. It could be said that privacy begins 
when the tape recorder is turned off, when the camera is put down, when the pen rests by the 


paper. That is where private life begins. If we record everything on the web, that could also be 
where privacy ends. 


PRIVACY AND THE DIGITAL ARCHIVE: OUTLINING KEY ISSUES 


Public Access Public Access 


. (PO*+ PO’) 
Policy Options 


Privacy Protection Privacy Protection 


Figure 1 Figure 2 
Privacy Protection and Public Access Privacy Protection and Public Access 
As Trade-Off As Search for Optimal Policy 
PO* A public access policy option PO, Policy Options - Initial 
PO’ A privacy protection policy option PO,, Policy Options - Open Society 


PO,, Policy Options - Closed Socity 


Documenting the Digital Age 1997 Marc Rotenberg 


11 


How Do we Make Electronic Archives Usable and Accessible? 


igs prepared for Documenting the Digital Age, San Francisco, February 10 - 
, 1997 


Margaret Hedstrom, School of Information, University of Michigan 


Let me begin my remarks with a description of the access process in many 
4 electronic archives today. 


A junior in college is working on a research paper about Gulf War 
Syndrome among women. She has read newspaper and journal articles, 
two books on the topic, and several government reports. She would like 
to find some quantitative data so that she can compare women and men. 
She also read that there was an on-line discussion group of women who 
served in the Gulf War and she would like to find its archive to analyze 


how they have reacted to Gulf War Syndrome. 


She finds several on-line catalogs from repositories of electronic records 
and identifies potentially useful sources. She sends an e-mail request to 
one repository where the staff photocopy the finding aid and user 
documentation for each data set of interest and mail it to her. After 
reviewing the documentation, she selects two files of interest and faxes 
an order for them. One file is available on a floppy disk; the other is 
available only on magnetic tape. The archives has a three day backlog of 
copy requests. When her request reaches the front of the queue, 
archives staff copy the requested files and mail them to the student. 
Three weeks elapse between the student's initial interest and receipt of 
the data. The student has to locate a computing facility on campus that 
maintains a tape drive. She has to reformat the data and arrange to 
transfer it through the campus network to her personal workstation. 
She cannot find the address or any information about the e-mail 
discussion group so she decides to abandon that part of her analysis. 


Meanwhile, the archives staff has compiled data on the use of its 
collection. The use figures are appalling. Although they receive several 
hundred e-mail questions each month asking about specific data sets, 
most requesters lose interest when they leam that they have to purchase 
data on diskettes or magnetic tape and wait for it to be copied and 
shipped. Administrators are asking: why are we keeping all of this stuff 


when no one uses it? 


Although this vignette does not describe the only way that electronic 
archives are made accessible today, it illustrates all too common problems with 
access to archival materials in electronic form. Locating potentially useful 
sources can be frustrating and undependable because access tools are not 
comprehensive or integrated into a uniform access system. Retrieving materials 
from off-line storage and delivering them in antiquated formats is time 


he researcher. 
consuming and labor intensive for both the repository and t hin iuestraents 


Little used archives are difficult to justify, especially when ong " 
in physical maintenance of the collection are necessary to avoid physica 
deterioration and technological obsolescence. 


More importantly, this vignette illustrates the relationships between 
accessibility, convenience, levels of use, and the costs of delivering services. 
While there are no comprehensive analyses of the use patterns among 
established electronic archives, anecdotal evidence suggests that the more 
readily accessible the materials, the more likely they are to be used.” Moreover, 
making electronic records accessible on-line can be more cost-effective for the 
repository than producing custom made copies on demand. We should use 
several criteria to devise and select strategies that will make electronic archives 
accessible and usable. First, we should identify approaches to access that 
best satisfy users’ needs. Second, we should consider how to provide access 
to electronic archives at a reasonable cost and in a more economic manner than 
is common in archives today. Third, we should make certain that providing 
access to electronic archives will improve the process and results of research. 
Accessibility is imperative for electronic archives, not only to meet rising user 
demands and expectations, but to develop an economically sustainable model 
of archival services. 


My conception of “electronic archives” is a highly distributed one. In 
thinking about accessibility and usability, | envision a world where a wide 
variety of institutions and individuals will take responsibility for preserving 
digital information. Some electronic archives will be maintained by special 
repositories dedicated exclusively to preserving and providing access to digital 
information; others will be extensions of traditional archives with hybrid 
collections of paper-based and electronic records. Many valuable electronic 
sources, however, will be made available directly by their original creator or 
producer because it is impractical to transfer custody to a special repository or 
because the institution or individual who created the records has ongoing need 
for them.” We need to develop strategies and methods for accessibility and 
usability that can span a variety of custodial arrangements. 


The use of computer and network technologies to disseminate 
descriptive information about archival records and to provide remote access to 
their contents shows promise of vastly improving access to archival records. 
National and international databases, such as the Research Libraries 
Information Network (RLIN) maintained by Research Libraries Group (RLG), 
contain catalog records that describe more than half a million archival 
collections in repositories around the world. The archival community is in the 
final stages of developing and promulgating a standard for Encoded Archival 
Description which uses SGML to produce browsable and searchable on-line 
finding aids for special collections. These are important building blocks in the 
development of comprehensive and integrated access systems for archival 
materials, although much remains to be done to realize their full potential. Only 
a small percentage of all archival records are described in network-accessible 
databases, and most descriptions only provide access at the very general level 


of the collection or records series. Only a minuscule portion of current archival 

records have machine-readable finding aids, indexes, and other access tools 

that help researchers locate specific documents or items, and only a tiny 

alae of current holdings have been converted to digital formats for network 
elivery. 


One of the first steps that we can take to make electronic archives 
accessible is to integrate descriptive information about them into existing 
access systems for archives, special collections, and other primary source 
materials. This is important for several reasons. Users should be able to locate 
electronic sources through local, national, and international access systems 
without having to search separate catalogs or databases which are segregated 
by format or form of material. As electronic archives become comprised of 
Thultifarious and highly heterogeneous types of information, segregation by 
format (electronic versus non-electronic) will present obstacles to accessibility. 

It made sense to establish special access systems for machine-readable data 
archives when the term “machine-readable” was largely synonymous with 
numeric and statistical data files. Now, electronic archives can contain any 
form of material ~ textual documents, photographs, sound, moving images, 
maps, drawings, or data. We still live in a hybrid environment where many 
Processes are only documented adequately through a combination of electronic 
and paper sources. Maintaining linkages between different formats of materials 
will become increasingly burdensome if we do not find ways to develop 
integrated access systems. Multi-media products defy categorization by format, 
and | would urge us to avoid the temptation to establish yet another type of 
archive --the multi-media archive. 


We should take the notion of integrating electronic archives into access 
_systems for traditional archival materials one step further by investigating ways 
to integrate or link access systems for archives (paper and electronic) with the 
access systems for cultural resources that reside in traditional and digital 
libraries, museums, and other cultural institutions. Access systems for 
electronic archives should allow users to navigate through layers of increasingly 
detailed description that will help them identify, locate, and evaluate primary 
source material. Making electronic resources available on-line will not obviate 
the need for cataloging and descriptive information about each resource, and it 
may, in fact, make such information even more critical. Access systems should 
provide core descriptive elements that identify the origin or creator of the - 
source, its title, inclusive dates, extent, and some minimal level of subject 
access. The Dublin Core metadata elements, developed through a workshop 
sponsored by OCLC and the National Center for Supercomputing 
Applications, offers one model for such a high-level directory that could 
support discovery of network resources.‘ This level of description should lead 
to a more detailed finding aid, appropriate for archival materials, which 
describes the scope and content of the resource, provides evaluation criteria, 
and explains the origin and provenance of the source. For some resources, 
detailed indexes linked to the finding aid would provide access to files, 
documents, or specific items. For some types of material, technical 
documentation should be provided detailing such attributes as the file 


structure, coding or representation schemes, hardware and software 
requirements, or other features of the source. Users could navigate through 
these layers of description to identify and select materials relevant to their 
problem or research question. 

Before we go much farther with designing access systems, we need a 
systematic and comprehensive analysis of users and their requirements. Any 
effort to make electronic archives accessible and usable will be hindered by the 
lack of knowledge about current and potential users of archives. Even without 
the introduction of on-line access systems and networked resources, the user 
community for archival materials has become increasingly diverse in recent 
decades. Once the sanctum of historians and other scholars, archives have 
become known by and appealing to a larger, more popular, and more diverse 
user population. Alex Haley's book Roots is accredited with fueling a nascent 
movement of avocational researchers seeking records for genealogy and family 
history. The use of compelling primary source materials as illustrations in 
books or as the basis for documentary films, such as Ken Burns’ 
documentaries on the Civil War and Baseball, introduce primary source 
materials to large and popular audiences. Archival materials have played an 
increasingly central role in uncovering evidence from the past that supports 
legal claims against violations of civil rights or implied contracts, reveals 
patterns of negligence, or establishes linkages between exposure to certain 
agents and medical consequences which can have life threatening effects.° 
Teachers have begun to work with archivists to select archival materials for use 
in the curriculum because students find primary sources engaging and they 
provide excellent tools for learning how to evaluate and interpret various forms 


of evidence. 


These observations speak in part to the broad perspective that we must 

- maintain when considering what to keep in electronic archives, but ] am raising 
these points to address the question of accessibility. Although the general 
trends | described above are reinforced by anecdotal evidence of changing user 
needs and by scattered statistics from reading rooms, the archival community 
does not have a good understanding of its current or potential user community, 
their interests, their facility for using and understanding primary source material, 
or their needs. When we add to this the potential for making electronic 
archives accessible to a much larger user community, with different needs and 
abilities, sometimes without human mediation we add another layer of . — 
complexity to the question of accessibility. | would argue strongly for 
systematic studies of why users seek archival materials, what mechanisms they 
use to discover sources, how satisfied they are with the materials they find, 
how much they are willing to invest in finding and gaining access to archival 
materials, which delivery mechanisms they prefer, and what problems they 
encounter in using and interpreting the sources they find. Such research 
would be useful only if it were extended beyond the current user population to 
identify potential and future users whose needs may differ considerably from 
those of the current user population. Without such data, we will not be able 
to design access systems that address user needs effectively. 


Building electronic archives that are accessible to a wide variety of users 
in the formats they most prefer is only half the battle. We will have en 
accomplished little if we cannot deliver sources that are usable by reque ‘ ; 
There are numerous options and tools for delivering electronic documents to 
users with Internet access, but many of these approaches are not robust 
enough to deliver reliable, authentic, and usable archival records. The 7 
characteristics of archival records as documentary evidence of human activity 
demand specific strategies and management methods that will protect their 
integrity while enhancing access to their contents. 


One of the primary concerns is that most archival records have to be 
presented in a larger context because they rarely can stand alone as unique, 
bounded objects that are self explanatory. Contextual information about the 
creator, purpose, events surrounding the creation of a record, and its chain of 
custody is essential for determining the reliability of electronic documents and 
for interpreting their contents. The principle of provenance remains at the core 
of strategies for managing archives in the network environment. Respect for the 
Principle of provenance means that archival records must not be separated 
conceptually from the broader context of their origin, creation, and use. 
Contextual information, which is critical for interpreting the contents of archival 
records in any format, includes knowledge of the relationships among 
documents, the circumstances that gave rise to their creation, their intent or 
purpose, their receipt and use, and the chain of custody from the originator to 
the present custodian, 


Contextual information can be provided through a variety of means. 
Specific metadata that explicitly describes the context from which archival 
sources were derived can be attached or linked to each record or document. 
Structural elements can be imbedded in the documents to provide visual and 

‘ other semiotic clues about their origins. Adhering to the principle of 
provenance often demands examination of legal mandates or bureaucratic 
regulations which require the creation of certain types of records, biographical 
research about individuals, and knowledge of the administrative history, 
organizational structure, and business processes of the entities that generate 
records. At least at the outset, the people who are building electronic archives 
will have to make a concerted effort to capture or supply sufficient contextual 
information about the contents of digital archives because much of the digital 
information being generated today is not self-documenting. Our culture is 
inventing new forms of documentary evidence that are technologically complex, 
yet socially and culturally primitive. Electronic mail systems, for example, deliver 
a variety of messages in an undifferentiated structure, ranging from formal 
transactions, to deeply personal communications between friends, to 
anonymous postings on bulletin boards. Document conventions have not 
evolved sufficiently to support effective management of electronic records or 
consistent interpretation of their contents. Documentary forms are becoming 
more sophisticated and refined, however, with increasing possibilities for 
creating self-referential documents, and archivists are beginning to understand 
the core descriptive elements that must accompany content to make it 

meaningful. 


Recent research on electronic records management has identified 
metadata models and elements that should accompany digital objects to _ 
support their authentication and long-term management. Although there is on 
single model or set of metadata specifications, several initiatives have propose 
ways to attach metadata to electronic documents or files in order to address 
problems of authentication, interpretation, and archiving. One such model. 
developed in a research project at the University of Pittsburgh, divides 
descriptive information about electronic records into six categories: 


1) registration metadata which uniquely identified each electronic object; 
2) terms and conditions metadata that contains information about access 
restrictions or other condition of use; 

3) structural metadata with information about the file or document 
structure; 

4) contextual metadata with information about the creation and 
provenance of the document; 

5) content metadata describing the logical and physical aspects of the 
content; and 

6) metadata on use of each record.® 


Presently, archivists would have to extract, compile, and structure this — 
metadata because few systems have been designed to supply and organize 
metadata in a consistent standardized manner. If models such as this become 
widely adopted, however, one can envision a time when more electronic records 


will be self-documenting. 


We should also develop the means to distribute the software needed to 
open. view, and analyze electronic materials with the records themselves. The 
‘problem of software dependency and software obsolescence is one of the most 
intractable obstacles facing electronic archives. Few archives have the technical 
resources to maintain obsolete versions of software that might be required to 
open, view. and manipulate archival records which were created using software 
that has been updated or replaced. The notion of a distributed electronic 
archive offers a partial solution to this problem. It should be technically feasible 
for a few sites to maintain older versions of software or emulators of older 
versions that run on the current generation of hardware and operating systems. 
Users needing access to older software in order to use electronic records_in 
obsolete formats would be able to download and install the software on their 
own workstations or submit requests to a server that supports the software. 
Such an approach would serve a dual purpose. It would provide users with 
access to software tools that are difficult to locate and install and it would 
provide a means to preserve software as an important intellectual and cultural 
resource in its own right. This approach will not eliminate the need for 
periodic migration of electronic records because eventually the incompatibilities 
between older software and current hardware and operating systems will 
become insurmountable. Nevertheless, this strategy could reduce the frequency 
of migrations, provide access to records with the same look and feel as their 
original format, and curb maintenance and migration costs. 


There is a great deal that archivists and designers can do to build 
electronic archives that are accessible and useable, but we should be cautious 
about placing all of the functionality into the archival system itself. Adequate 
descriptive information and techniques like time/date stamps and encryption, 
can be employed to prevent alteration of records. But we will need to launch a 
parallel effort to teach the users of electronic archives how to be discriminating 
and skeptical consumers of digital information. Learning how to evaluate and 
interpret evidence has always been an implicit goal of our educational system. 
While the specific skills needed to evaluate digital documents may differ from 
those used for older forms of records, they are no less essential. Here we can 
learn from the experience of European scholars and archivists who, upon 
discovering that many medieval documents were fakes and forgeries, developed 
the discipline of diplomatics in the seventeenth century to analyze and 
authenticate documents.? Some archivists today are applying the principles of 
diplomatics to digital information with the intent of building into modem 
information systems the capability of producing reliable and authentic records. 

ut we must also think about ways to teach users the principles of a new 
digital diplomatics so that they can apply these principles themselves to make 
educated judgments about the accuracy, reliability, and authenticity of the 
documents that they retrieve from electronic archives. We need to educate the 
next generation of scholars as well as the general public how to approach 
digital evidence with a questioning mind about how it was generated, why it 
was preserved, and how it might be interpreted. Until we feel as comfortable 
with electronic evidence as we do with traditional forms of documentation, 


archivists will have a responsibility to help users evaluate, understand, and 
interpret new documentary forms. 


The actions taken by individuals and organizations to save and care for 
‘their own archives will play a vital role in enriching the archival record. We 
should pursue strategies that change the norms of individual record keeping, 
allow people to build their own digital archives, increase awareness of the 
practical and cultural value of documentary evidence, and developing simple 
tools that help individuals and organizations save and protect their records. 
Current software tools that "save" or “archive” documents, whether designed 
for individuals using microcomputers or for complex networks, fall short of what 
is needed to capture and preserve meaning-rich records. While personal and 
organizational collections of digital materials might be turned over to specialized 
archives at some point, the ability of archival repositories to provide meaningful 
access to such collections will depend to a large extent on the measures that 
the original creators take to organize, describe and care for their records. To 
the extent possible, recordkeeping standards and practices should be 
integrated into the processes of records creation and maintenance, support the 
access and retrieval requirements of the records creator, and protect the 
integrity and authenticity of records.'° 


No discussion of accessibility and usability would be complete without 
raising the issues of affordability and access restrictions. Increasing concerns 
about personal privacy, efforts to gain or retain control over intellectual 


property, and the growth of fee-based access services all work against widely 
accessible electronic archives. Archivists and researchers will not be able to 
shape individual or societal norms about privacy and access to personal or 
confidential information. Yet there are some practical measures that the 
developers of digital archives can take to mitigate privacy concerns and support 
legitimate access to private or confidential information. Any electronic archives 
should develop comprehensive policies that define the terms and conditions for 
release of records, the degree of access restrictions acceptable to the archives, 
and the requirements for use of restricted sources. Prior to acquiring or 
gaining control over materials, the archives should negotiate with each donor a 
clear statement of access restrictions. Some archives will need to develop 
redaction capabilities that mask individual identities or permit the selected 
release of portions of files or documents. In developing policies for access, 
electronic archives can learn much from the experience of repositories of 
traditional formats of materials. Respectable archives have formal access 
policies and the archival profession as a whole embraces the principle that 
restrictions on access should be kept to a minimum. If access restrictions are 
necessary to comply with privacy or other access restrictions or to secure 
donations of materials, access restrictions should apply equitably to all users to 
specific categories of users.!! 


The law and policies around intellectual property and user fees will also 
be decided by forces outside the archival community. Nevertheless, developers 
of electronic archives must be cognizant of the impact of intellectual property 
issues on both the usability of the archive and the complexity of its 
administration. From the user's perspective, electronic archives should 
encourage. if not require, donors to place their materials in the public domain. 
If this is not possible, the archives should negotiate for liberal fair use 
provisions. Regardless of the outcome of such negotiations, it will be essential 
for the archives to carefully document the copyright status of its holdings and 
the provisions for requesting permission to use materials that are subject to 
copyright. Likewise, electronic archives should resist the temptation to impose 
user fees for personal, scholarly, or educational uses of the archives. In 
building electronic archives, we are creating a cultural resource and serving a 
larger public good. While charges for the commercial use of the archives might 
provide one source of revenue, we should not subordinate the larger social and 
cultural objectives of electronic archives to their commercial viability. 


Let me close with an alternative vision of using electronic archives: 


The college junior with her research topic on women and the Gulf War 
Syndrome in mind searches a high-level directory using natural lanquage 
to describe her research question and define the types of sources of 
interest. The search returns a list of eighteen possible sources at five 
different sites ranked by relevance to her selection criteria. She is most 
interested in the third and fourth items on the list and asks for additional 
information about them. The search returns the full text of the finding 
aid and a database with all of the data elements in each file. She 


searches these to discover that only one of the data files breaks down 
the data by gender. She then looks at other attributes and discovers 
that this source is a complete registry of Gulf War veterans who have 

been treated for Gulf War Syndrome. She can download a public use 


version of the file which includes data on each case but does not include 
personal identifiers. She requests the file and four minutes later, it 
Iso listed the address of an 


resides on her hard drive. The initial search a 
e-mail discussion group of women afflicted with Gulf War Syndrome with 


instructions about how to access the archive. She had not considered a 
source like this, but now decides to use it as well to analyze how women 


are responding to Gulf War Syndrome. 
The archive is keeping detailed statistics on requests and use of its 
collection. They notice is that those sources which can be downloaded 
directly by users are almost fifty times more likely to be used than those 
that have to be ordered and shipped using off-line media. They also use 
the statistics on requests to decide which types of sources to pursue. 
They have noticed a fifteen-fold increase in requests since they started 
the remote access service, but since most of these requests are self-serve 
the demand for technical services has actually declined. The reference 
staff is very busy answering e-mail and helping users interpret their data. 
The head of the archives uses these statistics, along with several letters 
from requesters praising the service, to make the case to his Board that 
this is a valuable service. He secures an increase in funding that will be 
used to hire more reference staff and put more collections on-line. 


This is the future that we should strive for in electronic archives. In 
order to achieve this vision, we will need to enhance and link access systems 
so that electronic records are widely known or easily discoverable through the 
access systems that requesters normally use when seeking archival materials. 
We will have to develop the means to deliver materials as seamlessly as 
possible, with the minimum restrictions on reuse, and at little or no cost. The 
objects that are delivered will be useful only if they are accompanied by or can 
be linked to rich resources of descriptive and contextual information. This 
contextual information will help end users assess the quality, reliability, and 
relevance of the documents to their problem or question. Pointers will help 
them find similar or related materials if they wish to delve further into the 
electronic archive or find relevant print sources. But we should not expect to 
build all of the selection and evaluation capabilities into the archive itself. We 
must also educate users to become discriminating consumers of archival 
materials and critical readers of electronic evidence. 


Notes 


, ‘ ing debate 

? The issue of Physical custody of electronic tecords is the subject wh ai —_ er » Indiotonaiite 
among archivists, For recent discussions of this question see ie , earmesaesiens t of 
Bastion: Archives As a Repository in the Electronic Age,” in Are. oe tics Technical Report, 
Electronic Records, David Bearman, ed., Archives and Museum In ig sa is Mecvene 
No. 13 (1991), 14-24; Kenneth Thibodeau, "To Be or Not To Be: Arc’ a ee 
Records,” in Archival Management of Electronic Records, 1-13; Marga bof Electronic 
“Archives: To Be or Not To Be? A Commentary,” in Archival i oy Si aoa. Foley and 
Records, 25-30: Terry Cook, heaving poe be ha ad Dnee ara halal 

i Arrangements at the ational Archives of Ca ada, _ 
ieee, 9:2 (1995), 141-49; and the November 1996 issue, Archives and Manuscrip 
(vol. 24, 


no. 2), the journal; of the Australian Society of Archivists which was devoted to the 
issue of custody and Post-custodial archives. 


3 The authoritative web site for information about the EAD is 
<http:/cweb2.loc.gov/ammem/ead/= 


* For information about the Dublin Core and the subsequent Warwick Framework, see 
<www.ocle.org:5046/co 


nferences/metadata/dublin_core_teport htnl= and Lorcan Dempsey 
and Stuart Weibel, “The W. 


le Warwick Metadata Workshop: A Framework for the Deployment of 
Resource Description.” D-Lib Magazine (July/August 1996): 
<www.dlib.org/dlib /july96/07weibeltmis, 


* Alex Haley, Roots, Garden City, NuJ.: Doubleday, 1976. 


.° One example of this type of resource is the Comprehensive Epidemiologic Data Resource 
developed by the U.S. Department of Energy to provide public access to data about health and 
exposure data at DOF installations. See <http://cedr.Ibl.gov/>. 

? Margaret Hedstrom, “Electronic Archi 
vironment,” in Networking in the Hu 


ves: Integrity and Access in the Network 
at Elvetham Hall, Hampshire, UK, 13- 


manities: Proceedings of the Second Conference held 
16 April 1994, Kent: Bowker-Saur, 1995: 77-95. 
® David Bearman and Ken Sochats, “Metadata Requirements for Evidence" (1995). 
<www lis. pitt.edu/— nhpre.html>. 


Sse ne SE Ae Area anes Par Ot ahaa 

(Winter 1991-92). 6-24. ‘ 

” A good example of effecti ; ement of 

Age," The seen oF ssl nay Sone eed ened Comer ing of ey ingle Poni 
. sored 

we ee au/ KA WWW /Prokssn/ ACA) Coen es ioe = = 


Meérican Library Association, a 


Joint s 
, ertcan Archivist 42:4 (Fall 1979), Statement on Access 


